Live
#16 medium infra_monitor
CPU alert on claw-gateway1 — 91.2% (threshold: 90%)
Host: claw-gateway1 CAUSE: CPU exceeded the 90% warning threshold. IMPACT: Performance may degrade if the trend continues. ACTION: Monitor for sustained elevation; investigate if it persists beyond 15 minutes. CPU: 91.2% | Memory: 55.2%
Opened 2026-04-28 00:05 UTC · Resolved 2026-04-28 00:20 UTC
Handoff Notes ← Dashboard
Timeline
WEBHOOK
2026-04-28 00:05 UTC
Alert received from AI Infra Monitor. Host: claw-gateway1, Severity: MEDIUM
STATUS CHANGE
2026-04-28 00:05 UTC
OPEN -> INVESTIGATING (auto - low/medium severity)
CONTEXT AGGREGATED
2026-04-28 00:05 UTC
Sources available: 3/3 — Runbook: ✓ | Past incidents: ✓ | Infra health: ✓
Response Plan
2026-04-28 00:05 UTC

Severity

MEDIUM: Single gateway at 91.2% CPU; degraded performance risk if sustained >15 min.

Root Cause

  • Runaway process consuming CPU on claw-gateway1
  • Traffic spike to gateway (check load balancer routing)

Actions

  1. SSH to claw-gateway1; run top -b -n1 | head -20 to identify top CPU consumer.
  2. If single process: kill gracefully, monitor recovery for 2 min.
  3. If traffic spike: check load balancer; verify no misconfiguration routing excess traffic here.
  4. If CPU remains >85% after 5 min, restart the claw-gateway service: sudo systemctl restart claw-gateway.
  5. If CPU stays >85% post-restart, reboot host: sudo reboot (expect 2–3 min downtime).

Watch

  • CPU trend: alert if it exceeds 95% or stays >80% for >10 min.
  • Memory: confirm it doesn't spike during remediation (currently healthy at 55.2%).

Escalate If

CPU remains >85% after service restart, or if gateway becomes unresponsive.

STATUS CHANGE
2026-04-28 00:15 UTC
Auto-resolver: CPU at 13.9% (below 70% clear threshold) — clean check 1/2
STATUS CHANGE
2026-04-28 00:15 UTC
Auto-resolver: CPU at 13.9% (below 70% clear threshold) — clean check 1/2
STATUS CHANGE
2026-04-28 00:20 UTC
Auto-resolver: CPU at 13.9% (below 70% clear threshold) — clean check 2/2
STATUS CHANGE
2026-04-28 00:20 UTC
AUTO-RESOLVED: CPU sustained below 70% for 2 consecutive checks. Current value: 13.9%
·
HANDOFF
2026-04-28 01:54 UTC
Handoff notes generated: # Shift Handoff Notes - **Incident**: CPU spike on claw-gateway1 reached 91.2% at 00:05 UTC on 2026-04-28; triggered MEDIUM severity alert. Root cause suspected to be runaway process and/or traffic spike to gateway. - **Resolution**: CPU automatically recovered to 13.9% by 00:20 UTC; incident auto-resolved after 2 consecutive clean health checks below 70% threshold. - **Current State**: claw-gateway1 operating normally with CPU at baseline levels. No manual intervention was required. - **Monitor**: Watch for CPU spike recurrence on claw-gateway1 over next 2-4 hours. If spike returns, investigate: (1) top CPU-consuming process via `top`, (2) load balancer routing configuration for traffic anomalies. - **No Action Required**: Incident is resolved and stable. Escalate to infrastructure team only if alert fires again within shift.
·
HANDOFF
2026-05-01 09:14 UTC
Handoff notes generated: # Shift Handoff Notes - **Incident Summary**: CPU spike on claw-gateway1 reached 91.2% at 00:05 UTC on 2026-04-28, triggering MEDIUM severity alert. Suspected root cause: runaway process or traffic spike to gateway. - **Resolution**: CPU auto-resolved at 00:20 UTC after dropping to 13.9% and sustaining below 70% threshold for 2 consecutive checks (15 min total). - **Current State**: claw-gateway1 operating normally with CPU at 13.9%. No manual intervention was required; incident auto-resolved. - **Root Cause Unknown**: The spike occurred and resolved without clear identification of the underlying cause (runaway process or load balancer misconfiguration). Recommend investigating logs/metrics from 00:05–00:15 UTC if spike recurs. - **Watch For**: Monitor claw-gateway1 CPU over next 2–4 hours for recurring spikes at similar thresholds. If CPU >85% again, escalate to infrastructure team for deeper process/traffic analysis.
·
HANDOFF
2026-05-04 08:27 UTC
Handoff notes generated: # Shift Handoff Notes - **Incident**: CPU spike on claw-gateway1 reached 91.2% at 00:05 UTC on 2026-04-28, triggering MEDIUM severity alert. Suspected root cause: runaway process or traffic spike. - **Resolution**: CPU auto-resolved at 00:20 UTC after dropping to 13.9% and remaining below 70% threshold for 2 consecutive checks (15-minute window). - **Current State**: RESOLVED. claw-gateway1 CPU stable at 13.9%. No manual intervention was required; issue self-corrected. - **Watch For**: Monitor claw-gateway1 CPU over next few hours for recurrence. If spike returns, investigate top CPU consumer via `top` command and check load balancer routing configuration for misconfiguration. - **Runbook Available**: Full investigation steps and diagnostics documented in AI Infra Monitor runbook if escalation needed.
·
HANDOFF
2026-05-04 15:34 UTC
Handoff notes generated: # Shift Handoff Notes - **Incident**: CPU spike on claw-gateway1 reached 91.2% (threshold: 90%) at 00:05 UTC on 2026-04-28, triggering MEDIUM severity alert. Suspected root cause: runaway process or traffic spike. - **Resolution**: CPU auto-resolved to 13.9% by 00:20 UTC after 2 consecutive clean checks below 70% threshold. No manual intervention was required. - **Current State**: RESOLVED. claw-gateway1 operating normally with CPU at 13.9%. All context sources (runbook, incident history, infra health) confirmed. - **Next Steps**: Monitor claw-gateway1 for CPU regression over next shift. If spike recurs, SSH to host and run `top -b -n1 | head -20` to identify persistent runaway processes. Check load balancer routing configuration if traffic anomalies detected. - **Watch For**: Sustained CPU >85% lasting >15 min; if occurs, escalate and investigate process-level details before auto-resolution threshold is hit.
·
HANDOFF
2026-05-06 07:00 UTC
Handoff notes generated: # Shift Handoff Notes - **Incident**: CPU spike on claw-gateway1 reached 91.2% at 00:05 UTC on 2026-04-28, triggering MEDIUM severity alert. Suspected root cause: runaway process or traffic spike to gateway. - **Resolution**: CPU auto-resolved to 13.9% by 00:20 UTC after two consecutive checks below 70% threshold. No manual intervention was required. - **Current State**: Incident is RESOLVED. claw-gateway1 is operating normally with CPU sustained well below alert threshold. - **Monitor**: Watch for CPU spikes on claw-gateway1 over the next shift. If alert recurs, investigate load balancer routing configuration and check for runaway processes using `top -b -n1 | head -20`. - **Follow-up**: Root cause analysis incomplete (process/traffic source not definitively identified). Consider deeper investigation if pattern repeats to prevent future incidents.
·
HANDOFF
2026-05-07 11:18 UTC
Handoff notes generated: # Shift Handoff Notes - **Incident**: CPU spike on claw-gateway1 reached 91.2% (threshold: 90%) at 00:05 UTC on 2026-04-28, triggered MEDIUM severity alert. Suspected root cause: runaway process or traffic spike. - **Resolution**: CPU automatically recovered to 13.9% within ~16 minutes; auto-resolver confirmed sustained clearance below 70% threshold across 2 consecutive checks and auto-resolved at 00:20 UTC. - **Current State**: RESOLVED. claw-gateway1 operating normally with CPU well below alert threshold. No manual intervention was required. - **Watch For**: Monitor for recurrence of CPU spikes on claw-gateway1. If spike reoccurs, investigate load balancer routing configuration and check for runaway processes using `top` command on the host. - **Follow-up**: Root cause analysis incomplete—recommend reviewing process logs and load balancer metrics from 00:05–00:20 UTC window to confirm whether spike was process-driven or traffic-driven for pattern prevention.
·
HANDOFF
2026-05-14 11:00 UTC
Handoff notes generated: # Shift Handoff Notes - **Incident**: CPU spike on claw-gateway1 reached 91.2% (threshold: 90%) at 00:05 UTC on 2026-04-28, triggering MEDIUM severity alert. Suspected root cause: runaway process or traffic spike to gateway. - **Resolution**: CPU auto-recovered to 13.9% within 15 minutes; incident auto-resolved at 00:20 UTC after two consecutive checks confirmed sustained drop below 70% threshold. - **Current State**: claw-gateway1 operating normally with CPU well below alert threshold. No manual intervention was required. - **Watch For**: Monitor for recurrence of CPU spikes on claw-gateway1. If spike repeats, investigate load balancer routing configuration and identify any runaway processes using the planned diagnostics (`top -b -n1`). - **Root Cause**: Incomplete investigation — actual cause (process vs. traffic spike) was not definitively determined before auto-resolution. Consider deeper diagnostics if pattern repeats.
·
HANDOFF
2026-05-29 04:57 UTC
Handoff notes generated: # Shift Handoff Notes - **Incident**: CPU spike on claw-gateway1 reached 91.2% (threshold: 90%) at 00:05 UTC on 2026-04-28, triggering MEDIUM severity alert. Suspected root cause: runaway process or traffic spike. - **Resolution**: CPU auto-resolved to 13.9% by 00:20 UTC after 2 consecutive checks below 70% threshold. Incident automatically closed with no manual intervention required. - **Current State**: claw-gateway1 CPU nominal at 13.9%. No ongoing issues detected; gateway operating normally. - **Watch For**: Monitor for CPU spike recurrence on claw-gateway1 over next shift. If spike repeats, SSH in and run `top` to identify specific runaway process or check load balancer routing configuration for misalignment. - **Action Items**: None pending. Runbook and context sources available if incident recurs.
·
HANDOFF
2026-05-31 15:00 UTC
Handoff notes generated: # Shift Handoff Notes - **Incident**: CPU spike on claw-gateway1 reached 91.2% (threshold: 90%) at 00:05 UTC on 2026-04-28, triggering MEDIUM severity alert. Suspected root cause: runaway process or traffic spike. - **Resolution**: CPU auto-resolved to 13.9% by 00:20 UTC after two consecutive clean checks below 70% threshold. No manual intervention was required. - **Current State**: Incident is RESOLVED. claw-gateway1 CPU is stable at 13.9% as of last check. No ongoing issues detected. - **Watch For**: Monitor claw-gateway1 for CPU spikes >90% in the coming shift. If recurrence occurs, SSH to the host and run `top -b -n1 | head -20` to identify the culprit process. Also verify load balancer routing to rule out traffic misconfiguration. - **Root Cause TBD**: The incident self-resolved; underlying cause (runaway process vs. traffic spike) was not definitively identified. Consider adding process-level monitoring to claw-gateway1 for future troubleshooting.
·
HANDOFF
2026-06-01 06:18 UTC
Handoff notes generated: # Shift Handoff Notes - **Incident**: CPU spike on claw-gateway1 reached 91.2% (threshold: 90%) at 00:05 UTC on 2026-04-28, triggering MEDIUM severity alert. Suspected root cause was a runaway process and/or traffic spike. - **Resolution**: CPU auto-resolved at 00:20 UTC after dropping to 13.9% and sustaining below 70% threshold for 2 consecutive checks (5+ min total). - **Current State**: RESOLVED. claw-gateway1 operating normally with CPU at 13.9%. - **Root Cause Not Confirmed**: The incident resolved automatically before manual investigation could identify the specific runaway process or traffic pattern. No process kill or load balancer adjustment was required. - **Watch For**: Monitor claw-gateway1 CPU over next shift for recurrence. If spike returns, SSH in and run `top -b -n1 | head -20` to identify the culprit process before it auto-resolves. Check load balancer routing configuration if pattern repeats.
·
HANDOFF
2026-06-06 10:43 UTC
Handoff notes generated: # Shift Handoff Notes - **Incident**: CPU spike on claw-gateway1 peaked at 91.2% (threshold: 90%) on 2026-04-28 at 00:05 UTC, triggering MEDIUM severity alert. Suspected root cause: runaway process or traffic spike to gateway. - **Resolution**: Incident auto-resolved at 00:20 UTC after CPU dropped to 13.9% and remained below 70% threshold for two consecutive checks (15 minutes sustained). - **Current State**: claw-gateway1 is operating normally with CPU at 13.9%. No manual intervention was required; the spike appeared transient. - **Root Cause Not Confirmed**: The underlying cause (runaway process vs. traffic spike) was not definitively identified before auto-recovery. Recommend reviewing load balancer routing and claw-gateway1 process logs for 2026-04-28 00:05–00:20 UTC if similar spikes recur. - **Watch For**: Monitor claw-gateway1 CPU over the next shift for recurrence. If CPU spikes again, follow the runbook: SSH in, run `top` to identify the offending process, and check load balancer configuration for misrouted traffic.
·
HANDOFF
2026-06-12 10:52 UTC
Handoff notes generated: # Shift Handoff Notes - **Incident**: CPU spike on claw-gateway1 reached 91.2% (threshold: 90%) at 00:05 UTC on 2026-04-28, triggering MEDIUM severity alert. Suspected root cause: runaway process or traffic spike. - **Resolution**: CPU auto-resolved to 13.9% within 15 minutes and sustained below 70% threshold for 2 consecutive checks. Incident auto-closed at 00:20 UTC. - **Current State**: claw-gateway1 healthy with CPU at 13.9%. No manual intervention was required; spike appears to have self-resolved. - **Watch For**: Monitor claw-gateway1 for recurring CPU spikes. If pattern repeats, investigate load balancer routing configuration and review process logs for runaway jobs. Consider running `top` on the host to baseline normal CPU usage. - **Runbook Available**: Detailed AI response plan exists with manual troubleshooting steps (SSH, `top` command, kill procedures) if spike recurs.
·
HANDOFF
2026-06-13 00:45 UTC
Handoff notes generated: # Shift Handoff Notes - **Incident**: CPU spike on claw-gateway1 reached 91.2% (threshold: 90%) at 00:05 UTC on 2026-04-28, triggering MEDIUM severity alert. Suspected root cause: runaway process or traffic spike. - **Resolution**: CPU auto-resolved at 00:20 UTC after dropping to 13.9% and remaining below 70% threshold for 2 consecutive checks (15+ minutes). No manual intervention was required. - **Current State**: Host is healthy with CPU sustained well below alert threshold. Incident marked RESOLVED. - **Next Steps**: Monitor claw-gateway1 for recurrence of high CPU. If spike returns, SSH to host and run `top -b -n1 | head -20` to identify the culprit process. Check load balancer configuration for any routing misconfigurations. - **Watch For**: Any sustained CPU >85% over 15+ minutes; if pattern repeats, escalate to infrastructure team for deeper process/load analysis.
Update Status
Details
ID #16
Severity MEDIUM
Source infra_monitor
Status RESOLVED
Opened 2026-04-28 00:05