Live
#20 medium infra_monitor
CPU alert on claw-gateway1 — 93.0% (threshold: 90%)
Host: claw-gateway1 CAUSE: CPU exceeded the 90% warning threshold. IMPACT: Performance may degrade if the trend continues. ACTION: Monitor for sustained elevation; investigate if it persists beyond 15 minutes. CPU: 93.0% | Memory: 57.8%
Opened 2026-05-03 00:05 UTC · Resolved 2026-05-03 00:20 UTC
Handoff Notes ← Dashboard
Timeline
WEBHOOK
2026-05-03 00:05 UTC
Alert received from AI Infra Monitor. Host: claw-gateway1, Severity: MEDIUM
STATUS CHANGE
2026-05-03 00:05 UTC
OPEN -> INVESTIGATING (auto - low/medium severity)
CONTEXT AGGREGATED
2026-05-03 00:05 UTC
Sources available: 3/3 — Runbook: ✓ | Past incidents: ✓ | Infra health: ✓
Response Plan
2026-05-03 00:05 UTC

Severity

MEDIUM — Gateway CPU at 93%, trending upward; degradation risk if sustained >15 min, but no current traffic impact.

Root Cause

  • Runaway process or memory pressure triggering CPU contention
  • Recent deployment or config change on claw-gateway1

Actions

  1. SSH to claw-gateway1; run top -b -n1 | head -20 to identify top CPU consumer.
  2. Check last 5 min of logs: journalctl -u claw-gateway -n 100 --since "5 min ago" for errors or restarts.
  3. If single process >70% CPU: kill gracefully; if persists, prepare for restart.
  4. Confirm memory pressure not driving CPU (currently 57.8%—healthy).
  5. Monitor CPU drop over next 10 min; if still >85% at 10-min mark, trigger escalation.

Watch

  • CPU trend: should drop below 85% within 10 minutes of investigation.
  • Process list stability: watch for rapid process churn or respawning.

Escalate If

CPU remains ≥85% after 10 minutes of investigation or top process cannot be identified.

STATUS CHANGE
2026-05-03 00:15 UTC
Auto-resolver: CPU at 33.2% (below 70% clear threshold) — clean check 1/2
STATUS CHANGE
2026-05-03 00:15 UTC
Auto-resolver: CPU at 33.2% (below 70% clear threshold) — clean check 1/2
STATUS CHANGE
2026-05-03 00:20 UTC
Auto-resolver: CPU at 33.2% (below 70% clear threshold) — clean check 2/2
STATUS CHANGE
2026-05-03 00:20 UTC
AUTO-RESOLVED: CPU sustained below 70% for 2 consecutive checks. Current value: 33.2%
·
HANDOFF
2026-05-03 15:50 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert • **Incident Summary:** CPU spike on claw-gateway1 reached 93% at 00:05 UTC on 2026-05-03; alert auto-resolved at 00:20 when CPU dropped to 33.2% and remained stable for 2 consecutive checks. • **What Happened:** Medium-severity alert triggered by upward CPU trend. No manual intervention was required; system self-corrected within ~16 minutes, suggesting transient process or temporary load spike rather than persistent issue. • **Current State:** RESOLVED. CPU stable at 33.2% as of last check. No traffic impact was observed during the spike. • **Recommended Follow-up:** Review claw-gateway1 logs and process list during the 00:05–00:20 window to identify what caused the spike (recent deployment, runaway process, memory pressure). Check for any configuration changes or deployment activity on that host around that time. • **Watch For:** Recurrence of CPU spikes on claw-gateway1 over the next 2–4 hours. If spike repeats or sustains >15 min, escalate for deeper investigation into resource contention or process behavior.
·
HANDOFF
2026-05-09 05:18 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert • **Incident:** CPU spike on claw-gateway1 reached 93% at 00:05 UTC on 2026-05-03; alert triggered as MEDIUM severity with degradation risk if sustained. • **Resolution:** Alert auto-resolved at 00:20 UTC when CPU dropped to 33.2% and remained below 70% threshold for 2 consecutive checks (15 min total). No manual intervention was required. • **Current State:** claw-gateway1 operating normally with CPU at 33.2%. No ongoing traffic impact or service degradation observed. • **Root Cause:** Undetermined—likely transient spike from runaway process, memory pressure, or recent deployment. Process has self-resolved; no persistent issues identified. • **Watch For:** Monitor claw-gateway1 CPU over next 2-4 hours for recurrence. If spike repeats, SSH in and run `top -b -n1 | head -20` to identify culprit process, then check recent logs (`journalctl -u claw-gateway -n 100 --since "5 min ago"`) for errors or deployment anomalies.
·
HANDOFF
2026-05-14 07:25 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert • **What Happened:** CPU spike on claw-gateway1 reached 93% at 00:05 UTC on 2026-05-03, triggering a MEDIUM severity alert. Trending upward with potential degradation risk if sustained >15 min, but no traffic impact observed at time of alert. • **Resolution:** Alert auto-resolved at 00:20 UTC after CPU dropped to 33.2% and remained below 70% threshold for 2 consecutive checks (5-minute span). No manual intervention was required. • **Current State:** Incident RESOLVED. claw-gateway1 CPU is stable at 33.2%. No ongoing issues detected. • **Root Cause (Suspected):** Likely a transient runaway process or memory pressure spike; possible trigger from recent deployment or config change. Root cause was not definitively identified before auto-resolution. • **Watch For:** Monitor claw-gateway1 CPU over next shift for recurrence. If spike returns, SSH in and run `top` to identify the consuming process, then check recent logs for deployment/restart events. Escalate if CPU remains >90% for >15 min.
·
HANDOFF
2026-05-23 13:02 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert • **Incident:** CPU spike on claw-gateway1 reached 93% at 00:05 UTC on 2026-05-03, triggered as MEDIUM severity alert. Trending upward with degradation risk if sustained >15 min. • **Resolution:** Alert auto-resolved at 00:20 UTC after CPU dropped to 33.2% and remained below 70% threshold for 2 consecutive checks (15-min window). • **Current State:** RESOLVED. Host is healthy with CPU stable at 33.2%. No ongoing traffic impact observed. • **Root Cause:** Likely runaway process or memory pressure; investigation incomplete as incident self-resolved. No manual intervention required. • **Watch For:** Monitor claw-gateway1 for CPU spikes over next 24 hours. If alert recurs, SSH to host and run `top -b -n1 | head -20` to identify resource consumer. Check recent deployments or config changes on this gateway.
·
HANDOFF
2026-05-29 04:57 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert • **What Happened:** CPU spike on claw-gateway1 reached 93% at 00:05 UTC on 2026-05-03, triggering a MEDIUM severity alert. No traffic impact observed at the time. • **Resolution:** Alert auto-resolved at 00:20 UTC after CPU dropped to 33.2% and remained below 70% threshold for two consecutive checks (5-min interval). No manual intervention required. • **Current State:** claw-gateway1 is healthy; CPU nominal at 33.2%. No runaway processes or service restarts detected. Root cause (runaway process vs. memory pressure) was not explicitly identified before resolution. • **Watch For:** Monitor for recurrence of CPU spikes on claw-gateway1. If spike repeats, check for recent deployments, config changes, or new workload patterns. Have runbook ready with `top` and `journalctl` diagnostics. • **No Action Required:** Incident fully resolved and auto-closed. System stable going into next shift.
·
HANDOFF
2026-05-31 14:59 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert • **Incident:** CPU spike on claw-gateway1 reached 93% at 00:05 UTC on 2026-05-03, triggered MEDIUM severity alert with degradation risk if sustained >15 min. • **Resolution:** Alert auto-resolved at 00:20 UTC when CPU dropped to 33.2% and remained below 70% clear threshold for 2 consecutive checks (15+ min apart). No manual intervention required. • **Root Cause:** Suspected runaway process or memory pressure; likely transient. No traffic impact observed during spike. • **Current State:** Fully resolved. claw-gateway1 CPU stable at 33.2% as of last check. • **Next Steps:** Monitor for recurrence over next shift. If spike returns, SSH to host and run `top -b -n1 | head -20` to identify top CPU consumer, then check recent logs (`journalctl -u claw-gateway -n 100`) for errors or deployments.
·
HANDOFF
2026-05-31 16:47 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert • **Incident:** CPU spike on claw-gateway1 reached 93% at 00:05 UTC on 2026-05-03, triggering a MEDIUM severity alert. No traffic impact observed; degradation risk if sustained >15 min. • **Resolution:** Alert auto-resolved at 00:20 UTC when CPU dropped to 33.2% and remained below 70% threshold for 2 consecutive checks (15-min span). • **Current State:** RESOLVED. claw-gateway1 operating normally with CPU at baseline levels. • **Watch For:** Monitor for recurrence of CPU spikes on claw-gateway1. If repeats, investigate runaway processes, memory pressure, or recent deployments/config changes via `top` and `journalctl` logs. • **No Action Required:** Incident auto-resolved with no manual intervention needed. Consider reviewing alert thresholds if false positives persist.
·
HANDOFF
2026-06-06 10:42 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert • **Incident:** CPU spike on claw-gateway1 reached 93% at 00:05 UTC on 2026-05-03, triggering a MEDIUM severity alert. No traffic impact observed at time of alert. • **Resolution:** Alert auto-resolved at 00:20 UTC after CPU dropped to 33.2% and remained below 70% threshold for 2 consecutive checks. Root cause not explicitly identified; likely transient process or memory pressure. • **Current State:** RESOLVED. CPU stable at 33.2% as of final check. No manual intervention was required. • **Recommended Follow-up:** If spike recurs, SSH to claw-gateway1 and run `top` to identify runaway processes. Check `journalctl` for errors or recent deployments that may have triggered the spike. • **Watch For:** Monitor claw-gateway1 CPU over next 2-4 hours; if spike repeats or sustains >15 min, escalate and investigate recent config changes or deployments.
·
HANDOFF
2026-06-13 02:53 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert • **Incident:** CPU spike on claw-gateway1 reached 93% at 00:05 UTC on 2026-05-03, triggering a MEDIUM severity alert. No traffic impact observed. • **Resolution:** Alert auto-resolved at 00:20 UTC after CPU dropped to 33.2% and remained stable below 70% threshold for two consecutive checks (15-minute span). • **Current State:** RESOLVED. claw-gateway1 CPU nominal at 33.2%; no ongoing degradation or service impact. • **Root Cause:** Likely runaway process or memory pressure; specific cause not identified during auto-resolution window. Consider reviewing deployment changes or process logs on claw-gateway1 if CPU spikes recur. • **Watch For:** Monitor claw-gateway1 for CPU spike recurrence. If alert triggers again, manually SSH and run `top` to identify the offending process before auto-resolution kicks in.
Update Status
Details
ID #20
Severity MEDIUM
Source infra_monitor
Status RESOLVED
Opened 2026-05-03 00:05