Live
#39 medium infra_monitor
CPU alert on claw-gateway1 — 90.0% (threshold: 90%)
Host: claw-gateway1 CAUSE: CPU exceeded the 90% warning threshold. IMPACT: Performance may degrade if the trend continues. ACTION: Monitor for sustained elevation; investigate if it persists beyond 15 minutes. CPU: 90.0% | Memory: 44.0%
Opened 2026-05-27 16:50 UTC · Resolved 2026-05-27 17:09 UTC
Handoff Notes ← Dashboard
Timeline
WEBHOOK
2026-05-27 16:50 UTC
Alert received from AI Infra Monitor. Host: claw-gateway1, Severity: MEDIUM
STATUS CHANGE
2026-05-27 16:50 UTC
OPEN -> INVESTIGATING (auto - low/medium severity)
CONTEXT AGGREGATED
2026-05-27 16:50 UTC
Sources available: 3/3 — Runbook: ✓ | Past incidents: ✓ | Infra health: ✓
Response Plan
2026-05-27 16:50 UTC

Severity

Medium: CPU at threshold; no impact yet, but upward trend requires immediate investigation.

Root Cause

  • Runaway process or resource leak on claw-gateway1
  • Traffic spike or batch job consuming CPU

Actions

  1. SSH to claw-gateway1; run top -b -n 1 | head -20 to identify top CPU consumers.
  2. If single process >50% CPU: kill/restart it; if distributed load: check active connections with netstat -an | grep ESTABLISHED | wc -l.
  3. If CPU remains >80% after 3 minutes, restart ADOStack services: systemctl restart ado-stack.
  4. If CPU still >80% after restart, trigger host reboot (2–3 min downtime expected).
  5. Post-incident: review logs for anomalies; check for cron jobs or scheduled tasks.

Watch

  • CPU trend (alert if sustained >85% for 5 min; escalate at >95%).
  • Process list for memory leaks or stuck threads.

Escalate If

CPU remains >85% after service restart or top process is unidentifiable.

STATUS CHANGE
2026-05-27 17:04 UTC
Auto-resolver: CPU at 33.8% (below 70% clear threshold) — clean check 1/2
STATUS CHANGE
2026-05-27 17:05 UTC
Auto-resolver: CPU at 33.8% (below 70% clear threshold) — clean check 1/2
STATUS CHANGE
2026-05-27 17:09 UTC
Auto-resolver: CPU at 33.8% (below 70% clear threshold) — clean check 2/2
STATUS CHANGE
2026-05-27 17:10 UTC
AUTO-RESOLVED: CPU sustained below 70% for 2 consecutive checks. Current value: 33.8%
·
HANDOFF
2026-05-29 04:57 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident Summary:** CPU alert triggered on claw-gateway1 at 16:50 UTC when usage hit 90% (medium severity). Likely causes identified as runaway process, resource leak, or traffic spike. - **Resolution:** CPU auto-resolved at 17:10 UTC after dropping to 33.8% and remaining below 70% threshold for 2 consecutive checks (~5 min duration). No manual intervention was required. - **Current State:** claw-gateway1 operating normally with CPU at 33.8%. Incident marked RESOLVED. - **Watch for:** Monitor for CPU spikes trending back toward threshold. If recurrence occurs, investigate top CPU consumers with `top` command and check active connections with `netstat` to identify resource-intensive processes or traffic anomalies. - **No further action required** unless alert re-triggers; escalate to platform team if pattern repeats.
·
HANDOFF
2026-05-31 15:00 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident Summary:** CPU alert (90%) triggered on claw-gateway1 at 16:50 UTC on 2026-05-27; medium severity. Auto-resolved at 17:10 UTC after CPU dropped to 33.8% and sustained below 70% threshold for 2 consecutive checks. - **Root Cause:** Not definitively identified. Suspected runaway process, resource leak, traffic spike, or batch job, but CPU normalized before manual investigation was needed. - **Resolution:** Alert auto-resolved via monitoring thresholds; no manual intervention required. Host returned to healthy state within ~20 minutes. - **Current State:** claw-gateway1 operating normally at 33.8% CPU as of last check. Incident marked RESOLVED. - **Next Steps for On-Shift Team:** Monitor claw-gateway1 for CPU spikes over next 24–48 hours. If alert re-triggers, SSH in and run `top -b -n 1 | head -20` to identify top CPU consumers and `netstat -an | grep ESTABLISHED | wc -l` to check connection count. If sustained >80%, consider restart or escalation.
·
HANDOFF
2026-06-01 04:39 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident:** CPU alert (90%) triggered on claw-gateway1 at 16:50 UTC on 2026-05-27 (medium severity); auto-resolved at 17:10 UTC after CPU dropped to 33.8% and sustained below 70% for 2 consecutive checks. - **Root Cause:** Likely a transient spike—either a runaway process, resource leak, or temporary traffic/batch job spike. No specific process was identified before auto-resolution. - **Current State:** RESOLVED. CPU currently stable at 33.8% with no ongoing alerts. - **Watch For:** Monitor claw-gateway1 for CPU creep or recurring spikes >80%. If this recurs, SSH in and run `top -b -n 1 | head -20` to identify the culprit (single process vs. distributed load), then escalate or restart as needed. - **No Further Action Required:** Incident resolved automatically; escalate only if pattern repeats within next shift.
·
HANDOFF
2026-06-06 10:42 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident:** CPU alert triggered on claw-gateway1 at 16:50 UTC (2026-05-27) when usage spiked to 90%; medium severity. - **Resolution:** Alert auto-resolved at 17:10 UTC after CPU dropped to 33.8% and remained below 70% threshold for 2 consecutive checks (~5 min duration). - **Current State:** Host is healthy with CPU at normal levels; no manual intervention was required. - **Likely Cause:** Suspected runaway process, resource leak, or temporary traffic/batch job spike—root cause not definitively identified due to auto-resolution. - **Watch For:** Monitor claw-gateway1 CPU trends over next shift. If alert recurs, SSH in and run `top -b -n 1 | head -20` to identify top CPU consumers; check active connections with `netstat -an | grep ESTABLISHED | wc -l` if load appears distributed.
·
HANDOFF
2026-06-09 09:32 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident:** CPU spike to 90% on claw-gateway1 triggered medium-severity alert on 2026-05-27 at 16:50 UTC; auto-resolved at 17:10 UTC after CPU dropped to 33.8% and sustained below 70% for two consecutive checks. - **Root Cause:** Not definitively identified; likely triggered by temporary traffic spike, batch job, or transient process, as CPU normalized without manual intervention. - **Current State:** RESOLVED. claw-gateway1 CPU stable at 33.8% with no active alerts. Host is operating normally. - **Watch For:** Monitor claw-gateway1 CPU trends over next shift. If alert recurs or CPU shows sustained upward trend >80%, investigate running processes with `top` and check for runaway jobs or traffic spikes. Review connection count with `netstat -an | grep ESTABLISHED | wc -l` if needed. - **Next Steps:** No immediate action required. Consider reviewing logs from 16:45–17:15 UTC if pattern repeats to identify root cause.
Update Status
Details
ID #39
Severity MEDIUM
Source infra_monitor
Status RESOLVED
Opened 2026-05-27 16:50