Live
#42 medium infra_monitor
CPU alert on claw-gateway1 — 96.1% (threshold: 90%)
Host: claw-gateway1 CAUSE: CPU exceeded the 90% warning threshold. IMPACT: Performance may degrade if the trend continues. ACTION: Monitor for sustained elevation; investigate if it persists beyond 15 minutes. CPU: 96.1% | Memory: 40.2%
Opened 2026-05-31 00:00 UTC · Resolved 2026-05-31 00:20 UTC
Handoff Notes ← Dashboard
Timeline
WEBHOOK
2026-05-31 00:00 UTC
Alert received from AI Infra Monitor. Host: claw-gateway1, Severity: MEDIUM
STATUS CHANGE
2026-05-31 00:00 UTC
OPEN -> INVESTIGATING (auto - low/medium severity)
CONTEXT AGGREGATED
2026-05-31 00:00 UTC
Sources available: 3/3 — Runbook: ✓ | Past incidents: ✓ | Infra health: ✓
Response Plan
2026-05-31 00:00 UTC

Severity

Medium—single gateway at 96.1% CPU; performance degradation risk if sustained >15 min.

Root Cause

  • Runaway process consuming CPU on claw-gateway1
  • Service memory leak or inefficient query

Actions

  1. Acknowledge alert in Telegram (chat ID: 6055821277).
  2. Run top -b -n 1 | head -20 and ps aux --sort=-%cpu | head -10 to identify offending process.
  3. Stop/restart affected ADOStack service; allow 2–3 min between restarts (max 3 attempts).
  4. If CPU remains >80% after restarts, escalate via curl to oncall.ado-runner.com or Telegram action button.
  5. Last resort: sudo reboot (2–3 min downtime expected).

Watch

  • CPU trend over next 15 min (target: drop below 90% within 5 min of restart).
  • Memory for creep (currently 40.2%—stable).

Escalate If

CPU sustained >80% after 3 service restart attempts or does not drop below 90% within 15 minutes.

STATUS CHANGE
2026-05-31 00:15 UTC
Auto-resolver: CPU at 31.3% (below 70% clear threshold) — clean check 1/2
STATUS CHANGE
2026-05-31 00:15 UTC
Auto-resolver: CPU at 31.3% (below 70% clear threshold) — clean check 1/2
STATUS CHANGE
2026-05-31 00:20 UTC
Auto-resolver: CPU at 31.3% (below 70% clear threshold) — clean check 2/2
STATUS CHANGE
2026-05-31 00:20 UTC
Auto-resolver: CPU at 31.3% (below 70% clear threshold) — clean check 2/2
STATUS CHANGE
2026-05-31 00:20 UTC
AUTO-RESOLVED: CPU sustained below 70% for 2 consecutive checks. Current value: 31.3%
STATUS CHANGE
2026-05-31 00:20 UTC
AUTO-RESOLVED: CPU sustained below 70% for 2 consecutive checks. Current value: 31.3%
·
HANDOFF
2026-06-06 10:43 UTC
Handoff notes generated: # Shift Handoff Notes: CPU Alert on claw-gateway1 - **What happened:** CPU spike to 96.1% on claw-gateway1 triggered MEDIUM severity alert at 00:00 UTC on 2026-05-31. Likely cause: runaway process or service memory leak on ADOStack. - **What was done:** Alert auto-resolved after 20 minutes when CPU dropped to 31.3% and remained below 70% threshold for 2 consecutive clean checks. Root cause of initial spike was not explicitly identified before resolution. - **Current state:** RESOLVED. claw-gateway1 CPU stable at 31.3%. No manual intervention was required—system self-recovered. - **Watch for:** Monitor claw-gateway1 CPU over next 2–4 hours for recurrence. If spike returns, manually investigate top processes and ADOStack service logs to identify persistent memory leak or inefficient queries before auto-resolution masks the issue again. - **No action required:** Alert is closed and infrastructure is healthy. Escalate only if CPU exceeds 90% again within this shift.
·
HANDOFF
2026-06-06 17:16 UTC
Handoff notes generated: # Shift Handoff Notes: CPU Alert on claw-gateway1 - **What happened:** CPU spiked to 96.1% on claw-gateway1 at 00:00 UTC on 2026-05-31, triggering a MEDIUM severity alert. Suspected root cause was a runaway process or service memory leak. - **What was done:** Alert auto-resolved at 00:20 UTC after CPU dropped to 31.3% and remained below 70% threshold for two consecutive checks (5-min interval). No manual intervention was required. - **Current state:** RESOLVED. claw-gateway1 CPU is stable at 31.3%. Gateway is operating normally with no performance degradation observed. - **Watch for:** Monitor claw-gateway1 CPU over the next 2–4 hours for any recurrence of spikes. If CPU exceeds 80% again, investigate the specific process using `top` and `ps aux` commands; may indicate a persistent memory leak or inefficient service query requiring restart or code fix. - **Escalation:** If spike recurs within this shift, check ADOStack service logs and consider a controlled restart. Escalate to Platform team if pattern repeats across multiple shifts.
·
HANDOFF
2026-06-06 17:16 UTC
Handoff notes generated: # Shift Handoff Notes: CPU Alert on claw-gateway1 - **Incident:** MEDIUM severity CPU alert on claw-gateway1 spiked to 96.1% on 2026-05-31 at 00:00 UTC; auto-resolved at 00:20 UTC after CPU dropped to 31.3% and sustained below 70% for 2 consecutive checks. - **Root Cause:** Suspected runaway process or service memory leak causing temporary CPU spike; exact offending process not identified in logs before auto-recovery. - **Current State:** RESOLVED. CPU stable at 31.3%; gateway operating normally with no active alerts. - **Watch For:** Monitor claw-gateway1 CPU trends over next 2–4 hours for recurrence. If CPU spikes >90% again, manually run `top` and `ps aux --sort=-%cpu` to identify specific process; consider service restart or memory leak investigation. - **Follow-up:** Review ADOStack service logs for errors or inefficient queries if this recurs within the week.
·
HANDOFF
2026-06-06 17:17 UTC
Handoff notes generated: # Shift Handoff Notes: CPU Alert on claw-gateway1 - **Incident:** MEDIUM severity CPU alert triggered on claw-gateway1 (96.1% at 2026-05-31 00:00 UTC); suspected runaway process or service memory leak. - **Resolution:** CPU auto-recovered to 31.3% within 20 minutes; alert auto-resolved at 00:20 UTC after passing 2 consecutive clean checks below 70% threshold. - **Current State:** claw-gateway1 operating normally with CPU stable at 31.3%; no manual intervention was required. - **Watch For:** Monitor claw-gateway1 for CPU creep or recurring spikes >80%. If spike recurs, identify offending process via `top` and `ps aux` commands, then restart affected ADOStack service (max 3 attempts with 2–3 min between restarts). - **Runbook Available:** Refer to AI Infra Monitor runbook if similar alerts occur; full context aggregation (runbook, past incidents, infra health) is available in the incident record.
·
HANDOFF
2026-06-09 08:32 UTC
Handoff notes generated: # Shift Handoff Notes: CPU Alert on claw-gateway1 - **Incident:** MEDIUM severity CPU alert on claw-gateway1 spiked to 96.1% on 2026-05-31 at 00:00 UTC; suspected runaway process or service memory leak based on AI analysis. - **Resolution:** CPU automatically normalized to 31.3% within 20 minutes and remained stable. Auto-resolver confirmed resolution after 2 consecutive clean checks below 70% threshold; incident closed at 00:20 UTC. - **Current State:** claw-gateway1 operating normally at ~31% CPU. No manual intervention was required. - **Watch for:** Monitor claw-gateway1 for CPU spikes over the next shift. If CPU exceeds 80%, identify the offending process using `top` and `ps aux --sort=-%cpu` and consider restarting the affected ADOStack service. Escalate if issue recurs within 24 hours or pattern emerges. - **Root Cause:** Not definitively identified—transient spike may indicate periodic workload surge, memory leak, or inefficient query. Review application logs and service metrics if alert repeats.
Update Status
Details
ID #42
Severity MEDIUM
Source infra_monitor
Status RESOLVED
Opened 2026-05-31 00:00