Live
#52 medium infra_monitor
CPU alert on claw-gateway1 — 96.2% (threshold: 90%)
Host: claw-gateway1 CAUSE: CPU exceeded the 90% warning threshold. IMPACT: Performance may degrade if the trend continues. ACTION: Monitor for sustained elevation; investigate if it persists beyond 15 minutes. CPU: 96.2% | Memory: 41.5%
Opened 2026-06-11 00:05 UTC · Resolved 2026-06-11 00:20 UTC
Handoff Notes ← Dashboard
Timeline
WEBHOOK
2026-06-11 00:05 UTC
Alert received from AI Infra Monitor. Host: claw-gateway1, Severity: MEDIUM
STATUS CHANGE
2026-06-11 00:05 UTC
OPEN -> INVESTIGATING (auto - low/medium severity)
CONTEXT AGGREGATED
2026-06-11 00:05 UTC
Sources available: 3/3 — Runbook: ✓ | Past incidents: ✓ | Infra health: ✓
Response Plan
2026-06-11 00:05 UTC

Severity

P1 Critical — Gateway node saturated (96.2% CPU); potential service degradation if sustained beyond 15 minutes.

Root Cause

  • Runaway process consuming CPU (memory healthy, disk normal)
  • Resource leak or unoptimized query spike on gateway

Actions

  1. SSH to claw-gateway1; run top -b -n1 | head -20 to identify top CPU consumer
  2. If single process >80% CPU: kill/restart that process; if distributed across many, proceed to step 3
  3. Check recent deployments or config changes in last 2 hours via git log
  4. If CPU remains >80% after 5 minutes, trigger graceful restart: sudo systemctl restart claw-gateway
  5. If CPU still >80% after restart, authorize host reboot (expect 2–3 min downtime); notify stakeholders first

Watch

  • CPU trending; alert if stays >85% for 10+ min or spikes >98%
  • Response latency / error rates on gateway endpoints

Escalate If

CPU remains >80% after restart attempt or host becomes unresponsive.

STATUS CHANGE
2026-06-11 00:15 UTC
Auto-resolver: CPU at 30.1% (below 70% clear threshold) — clean check 1/2
STATUS CHANGE
2026-06-11 00:15 UTC
Auto-resolver: CPU at 30.1% (below 70% clear threshold) — clean check 1/2
STATUS CHANGE
2026-06-11 00:20 UTC
Auto-resolver: CPU at 30.1% (below 70% clear threshold) — clean check 2/2
STATUS CHANGE
2026-06-11 00:20 UTC
AUTO-RESOLVED: CPU sustained below 70% for 2 consecutive checks. Current value: 30.1%
Update Status
Details
ID #52
Severity MEDIUM
Source infra_monitor
Status RESOLVED
Opened 2026-06-11 00:05