Live
#54 medium infra_monitor
CPU alert on claw-gateway1 — 95.7% (threshold: 90%)
Host: claw-gateway1 CAUSE: CPU exceeded the 90% warning threshold. IMPACT: Performance may degrade if the trend continues. ACTION: Monitor for sustained elevation; investigate if it persists beyond 15 minutes. CPU: 95.7% | Memory: 42.7%
Opened 2026-06-12 00:05 UTC · Resolved 2026-06-12 00:10 UTC
Handoff Notes ← Dashboard
Timeline
WEBHOOK
2026-06-12 00:05 UTC
Alert received from AI Infra Monitor. Host: claw-gateway1, Severity: MEDIUM
STATUS CHANGE
2026-06-12 00:05 UTC
OPEN -> INVESTIGATING (auto - low/medium severity)
CONTEXT AGGREGATED
2026-06-12 00:05 UTC
Sources available: 3/3 — Runbook: ✓ | Past incidents: ✓ | Infra health: ✓
Response Plan
2026-06-12 00:05 UTC

Severity

P1 (Critical). Single gateway at 95.7% CPU with +67% upward trend; performance degradation imminent.

Root Cause

  • Runaway process or resource leak on claw-gateway1
  • Sustained traffic spike or inefficient query pattern

Actions

  1. Acknowledge alert via Telegram (6055821277).
  2. SSH to claw-gateway1; identify top CPU consumer with top or ps aux --sort=-%cpu; kill if runaway.
  3. Restart affected ADOStack service(s); monitor CPU for 2 min.
  4. If CPU remains >80% after restart, trigger escalation via oncall.ado-runner.com API (curl command in runbook).
  5. If CPU still >80% at 5 min mark, execute sudo reboot (accept 2–3 min downtime).

Watch

  • CPU % on claw-gateway1 (target: <70% within 5 min).
  • Error rate / latency on dependent services post-restart.

Escalate If

CPU sustains >80% after service restart or host remains unreachable post-reboot.

STATUS CHANGE
2026-06-12 00:10 UTC
Resolved after verification: live CPU is 11.6%, memory 50.8%, no sustained high-CPU process remains. Duplicate medium alert path has been adjusted to stop opening On-Call incidents for yellow threshold breaches.
Update Status
Details
ID #54
Severity MEDIUM
Source infra_monitor
Status RESOLVED
Opened 2026-06-12 00:05