Live
#35 medium infra_monitor
CPU alert on claw-gateway1 — 93.4% (threshold: 90%)
Host: claw-gateway1 CAUSE: CPU exceeded the 90% warning threshold. IMPACT: Performance may degrade if the trend continues. ACTION: Monitor for sustained elevation; investigate if it persists beyond 15 minutes. CPU: 93.4% | Memory: 45.4%
Opened 2026-05-18 16:05 UTC · Resolved 2026-05-18 16:20 UTC
Handoff Notes ← Dashboard
Timeline
WEBHOOK
2026-05-18 16:05 UTC
Alert received from AI Infra Monitor. Host: claw-gateway1, Severity: MEDIUM
STATUS CHANGE
2026-05-18 16:05 UTC
OPEN -> INVESTIGATING (auto - low/medium severity)
CONTEXT AGGREGATED
2026-05-18 16:05 UTC
Sources available: 2/3 — Runbook: ✗ | Past incidents: ✓ | Infra health: ✓
Response Plan
2026-05-18 16:05 UTC

Severity

Medium. Single gateway host CPU spiking; service degradation likely if sustained, no outage yet.

Root Cause

  • Runaway process or sudden traffic spike to claw-gateway1
  • Memory pressure forcing swap/GC cycles (currently 45.4%, monitor for creep)

Actions

  1. SSH to claw-gateway1; run top -b -n1 | head -20 to identify top CPU consumer.
  2. Check recent deployments or config changes in last 30 min via deployment log.
  3. If single process >50% CPU: kill or restart service; if distributed, check inbound traffic (query LB metrics).
  4. Verify sister gateways (claw-gateway2, claw-gateway3) CPU normal; if elevated too, escalate to network/traffic team.
  5. Confirm CPU stabilizes below 85% within 5 min of intervention before standing down.

Watch

  • CPU trend: must flatten and drop below 85% within 10 min; if it hits 96%+ escalate immediately.
  • Memory usage: flag if it crosses 70%; swap activity signals deeper issue.

Escalate If

CPU remains >90% after process investigation, or affects >1 gateway host.

STATUS CHANGE
2026-05-18 16:15 UTC
Auto-resolver: CPU at 44.0% (below 70% clear threshold) — clean check 1/2
STATUS CHANGE
2026-05-18 16:15 UTC
Auto-resolver: CPU at 44.0% (below 70% clear threshold) — clean check 1/2
STATUS CHANGE
2026-05-18 16:20 UTC
Auto-resolver: CPU at 44.0% (below 70% clear threshold) — clean check 2/2
STATUS CHANGE
2026-05-18 16:20 UTC
AUTO-RESOLVED: CPU sustained below 70% for 2 consecutive checks. Current value: 44.0%
·
HANDOFF
2026-05-22 02:56 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **What happened:** CPU spike to 93.4% on claw-gateway1 at 16:05 UTC on 2026-05-18; alert auto-resolved at 16:20 after CPU dropped to 44% and remained stable for 2 consecutive checks. - **Root cause:** Likely runaway process or sudden traffic spike; memory pressure (45.4%) was also noted as a contributing factor. - **Actions taken:** System auto-resolved; no manual intervention was required. CPU returned to normal levels within 15 minutes. - **Current state:** RESOLVED. claw-gateway1 CPU stable at 44.0% as of last check. No service outage occurred. - **Watch for:** Monitor claw-gateway1 for CPU creep or recurrence. If spike repeats, SSH in and run `top` to identify the offending process; check deployment logs for recent changes. If memory usage continues climbing above 50%, escalate for deeper investigation.
·
HANDOFF
2026-05-23 17:16 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **What happened:** CPU spike to 93.4% on claw-gateway1 triggered MEDIUM alert at 16:05 UTC on 2026-05-18. Likely cause: runaway process or sudden traffic spike; memory pressure also noted (45.4%). - **What was done:** Alert auto-resolved at 16:20 after CPU dropped to 44.0% and sustained below 70% threshold for two consecutive checks. No manual intervention required. - **Current state:** Host is healthy; CPU stable at 44.0%. No service outage occurred. - **Watch for:** Monitor claw-gateway1 CPU and memory trends over next shift. If spike recurs, SSH in and run `top -b -n1 | head -20` to identify top CPU consumer. Check recent deployments/config changes as potential triggers. - **No runbook available:** Future oncall should document response steps for this recurring alert pattern.
·
HANDOFF
2026-05-29 04:57 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident Summary:** CPU spike to 93.4% on claw-gateway1 triggered MEDIUM alert at 16:05 UTC on 2026-05-18. Likely causes identified: runaway process, sudden traffic spike, or memory pressure. - **Resolution:** Alert auto-resolved at 16:20 UTC after CPU dropped to 44.0% and remained below 70% clear threshold for 2 consecutive checks (~15 min sustained). - **Current State:** RESOLVED. claw-gateway1 CPU stable at 44.0%. No service outage reported. Alert cleared without manual intervention. - **Root Cause Unknown:** Spike was transient; underlying cause (process, traffic, or memory) not identified. Runbook unavailable at time of incident. - **Watch For:** Monitor claw-gateway1 CPU for recurring spikes over next shift. If spike recurs, SSH in and run `top` to identify top CPU consumer, then cross-reference recent deployments/config changes. Check memory levels (was at 45.4%) for creep toward swap/GC cycles.
·
HANDOFF
2026-05-31 14:58 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident:** CPU spike to 93.4% on claw-gateway1 triggered MEDIUM alert at 16:05 UTC on 2026-05-18; auto-resolved at 16:20 after CPU dropped to 44.0% and remained stable for 2 consecutive checks. - **Likely causes identified:** Runaway process, sudden traffic spike, or memory pressure on the gateway host—root cause not definitively determined before auto-recovery. - **Current state:** RESOLVED. CPU sustained at ~44% well below the 70% clear threshold; no ongoing service degradation observed. - **What next shift should monitor:** Watch for CPU creep on claw-gateway1 over the next 24 hours. If spike recurs, SSH in and run `top` to identify the top CPU consumer; also check recent deployments or config changes. If a single process consistently consumes >50% CPU, consider restart or escalation. - **No action required now:** Incident fully auto-resolved; runbook unavailable but infra health and past incident context were sufficient for monitoring.
·
HANDOFF
2026-06-01 01:35 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident:** CPU spike to 93.4% on claw-gateway1 triggered MEDIUM alert at 16:05 UTC on 2026-05-18; auto-resolved at 16:20 after CPU dropped to 44%. - **Root Cause:** Likely runaway process or sudden traffic spike; memory pressure possible. No root cause definitively identified before auto-resolution. - **Current State:** RESOLVED. CPU stable at 44.0% for past 2+ hours; no service outage reported. - **Actions Taken:** Alert auto-resolved via threshold clearing (CPU <70% for 2 consecutive checks). No manual intervention required. - **Watch For:** Monitor claw-gateway1 CPU trending over next shift. If spikes recur, SSH in and run `top` to identify runaway processes or check deployment logs for recent config changes in the 30 min prior to spike.
·
HANDOFF
2026-06-06 10:42 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident:** CPU spike to 93.4% on claw-gateway1 triggered MEDIUM alert at 16:05 UTC on 2026-05-18; auto-resolved at 16:20 after CPU dropped to 44%. - **Root Cause:** Likely runaway process or sudden traffic spike; possible memory pressure. No investigation was performed before auto-resolution. - **Current State:** RESOLVED. CPU sustained below 70% for 2 consecutive checks (final reading: 44.0%). No service outage occurred. - **Watch For:** If CPU spikes recur on claw-gateway1, investigate using `top` to identify runaway processes. Check recent deployments and config changes within 30 minutes of spike. - **Next Steps:** Monitor claw-gateway1 CPU closely over next shift. If spike repeats, gather process list and deployment logs before resolution to determine root cause.
·
HANDOFF
2026-06-06 17:16 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident:** CPU spike to 93.4% on claw-gateway1 triggered MEDIUM alert at 16:05 UTC on 2026-05-18; auto-resolved at 16:20 UTC after CPU dropped to 44%. - **Root Cause:** Likely runaway process or sudden traffic spike; memory pressure was also a suspected factor. No manual intervention was required—alert auto-resolved. - **Current State:** RESOLVED. CPU sustained below 70% for 2 consecutive checks and remains stable at 44%. - **Watch For:** Monitor claw-gateway1 for recurring CPU spikes. If this repeats, investigate recent deployments/config changes and identify top CPU consumer via `top` command. - **Next Steps:** No immediate action needed, but consider adding runbook documentation for claw-gateway1 CPU escalation procedures if unavailable.
·
HANDOFF
2026-06-06 17:16 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident:** CPU spike to 93.4% on claw-gateway1 triggered MEDIUM alert at 16:05 UTC on 2026-05-18; auto-resolved at 16:20 after CPU dropped to 44% and remained stable. - **Root Cause:** Likely runaway process or sudden traffic spike; memory pressure was a secondary concern (was at 45.4% at time of resolution). - **Current State:** RESOLVED. CPU sustained below 70% threshold for 2+ consecutive checks. No manual intervention was required; incident auto-cleared. - **Watch For:** Monitor claw-gateway1 for CPU creep or recurrence of spikes. If spike recurs, SSH in and run `top -b -n1 | head -20` to identify the top CPU consumer. Check deployment logs for recent changes within 30 minutes of any future spike. - **No Action Required:** This was a transient spike that resolved automatically. Escalate only if pattern repeats within next 24 hours.
·
HANDOFF
2026-06-06 17:17 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident:** CPU spike to 93.4% on claw-gateway1 triggered MEDIUM alert at 16:05 UTC on 2026-05-18; auto-resolved at 16:20 after CPU dropped to 44.0% and sustained below 70% threshold for 2 consecutive checks. - **Root Cause:** Likely runaway process or sudden traffic spike; memory pressure was also suspected but not confirmed before auto-resolution. - **Current State:** RESOLVED. CPU currently at 44.0% and stable. No manual intervention was required; alert auto-cleared after threshold criteria met. - **Watch For:** Monitor claw-gateway1 for CPU creep and recurrence. If spike returns, SSH to host and run `top` to identify top CPU consumer. Check recent deployments or config changes from the 16:00–16:05 UTC window. - **No Action Required:** This was a transient spike with no confirmed root cause. If it recurs, escalate for process-level investigation.
·
HANDOFF
2026-06-09 03:33 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident:** CPU spike to 93.4% on claw-gateway1 triggered MEDIUM alert at 16:05 UTC on 2026-05-18; auto-resolved at 16:20 after CPU dropped to 44% and sustained below 70% threshold for 2 consecutive checks. - **Root cause:** Likely runaway process or sudden traffic spike; memory pressure was also a possible factor. No manual intervention was required—alert resolved automatically. - **Current state:** RESOLVED. CPU stable at 44% as of last check. No service outage occurred. - **Next shift should monitor:** Watch claw-gateway1 CPU trends over next 24–48 hours for recurrence. If spikes repeat, investigate deployment logs (last 30 min before incident) and run `top` to identify the top CPU consumer process. - **No action required** unless CPU alert triggers again. Runbook was unavailable at time of incident; consider updating monitoring documentation for gateway hosts.
Update Status
Details
ID #35
Severity MEDIUM
Source infra_monitor
Status RESOLVED
Opened 2026-05-18 16:05