Live
#23 medium infra_monitor
CPU alert on claw-gateway1 — 96.1% (threshold: 90%)
Host: claw-gateway1 CAUSE: CPU exceeded the 90% warning threshold. IMPACT: Performance may degrade if the trend continues. ACTION: Monitor for sustained elevation; investigate if it persists beyond 15 minutes. CPU: 96.1% | Memory: 51.7%
Opened 2026-05-08 00:00 UTC · Resolved 2026-05-08 00:21 UTC
Handoff Notes ← Dashboard
Timeline
WEBHOOK
2026-05-08 00:00 UTC
Alert received from AI Infra Monitor. Host: claw-gateway1, Severity: MEDIUM
STATUS CHANGE
2026-05-08 00:00 UTC
OPEN -> INVESTIGATING (auto - low/medium severity)
CONTEXT AGGREGATED
2026-05-08 00:00 UTC
Sources available: 3/3 — Runbook: ✓ | Past incidents: ✓ | Infra health: ✓
Response Plan
2026-05-08 00:00 UTC

Severity

P1: CPU 96.1% on gateway; potential service degradation if sustained.

Root Cause

  • Runaway process consuming CPU (identify via top)
  • Service memory leak or inefficient query loop

Actions

  1. Run top -b -n 1 | head -20 to identify culprit process.
  2. Stop affected ADOStack service; monitor CPU drop for 2 min.
  3. Restart service (up to 3 attempts); success = CPU <80%.
  4. If CPU persists >80% after restarts, POST escalation to oncall.ado-runner.com.
  5. Request approval; reboot host if escalation authorizes (2–3 min downtime).

Watch

  • CPU trend (target <80% post-restart; if climbing back, escalate immediately).
  • Service error logs for crashes or OOM events during restart cycle.

Escalate If

CPU >80% after 3 service restarts or doesn't drop within 5 minutes of first action.

STATUS CHANGE
2026-05-08 00:16 UTC
Auto-resolver: CPU at 22.2% (below 70% clear threshold) — clean check 1/2
STATUS CHANGE
2026-05-08 00:16 UTC
Auto-resolver: CPU at 22.2% (below 70% clear threshold) — clean check 1/2
STATUS CHANGE
2026-05-08 00:21 UTC
Auto-resolver: CPU at 22.2% (below 70% clear threshold) — clean check 2/2
STATUS CHANGE
2026-05-08 00:21 UTC
Auto-resolver: CPU at 22.2% (below 70% clear threshold) — clean check 2/2
STATUS CHANGE
2026-05-08 00:21 UTC
AUTO-RESOLVED: CPU sustained below 70% for 2 consecutive checks. Current value: 22.2%
STATUS CHANGE
2026-05-08 00:21 UTC
AUTO-RESOLVED: CPU sustained below 70% for 2 consecutive checks. Current value: 22.2%
·
HANDOFF
2026-05-08 04:10 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident Summary:** CPU spike to 96.1% on claw-gateway1 triggered MEDIUM severity alert at 00:00 UTC on 2026-05-08. Auto-resolved at 00:21 UTC when CPU dropped to 22.2% and sustained below 70% threshold for 2 consecutive checks. - **What Occurred:** Likely runaway process or service memory leak caused acute CPU spike. No manual intervention was required; alert auto-cleared without escalation. - **Current State:** CPU stable at 22.2%; claw-gateway1 operating normally. No service degradation reported. - **Watch For:** Monitor claw-gateway1 CPU over next 2-4 hours for recurrence. If CPU spikes above 80% again, escalate to oncall.ado-runner.com and investigate process list via `top` to identify persistent culprit. - **Runbook Available:** Full diagnostic steps documented in linked runbook if recurring alerts occur.
·
HANDOFF
2026-05-14 04:05 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident Summary:** CPU spike to 96.1% on claw-gateway1 triggered a MEDIUM severity alert at 00:00 UTC on 2026-05-08. Alert auto-resolved at 00:21 UTC after CPU dropped to 22.2% and remained stable. - **Root Cause:** Likely a runaway process or service memory leak; specific culprit process was not manually identified before auto-recovery occurred. - **Current State:** RESOLVED. CPU sustained below 70% threshold for 2 consecutive checks and remains at 22.2%. No manual intervention was required. - **Watch For:** Monitor claw-gateway1 CPU over the next shift for recurring spikes. If CPU exceeds 90% again, manually identify the culprit process using `top -b -n 1 | head -20` and investigate for memory leaks or inefficient query loops in ADOStack services. - **Follow-up:** Consider reviewing claw-gateway1 service logs and resource metrics from 00:00–00:21 UTC to determine root cause and prevent recurrence; escalate to oncall if CPU persists above 80% after service restarts.
·
HANDOFF
2026-05-16 14:12 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident:** CPU spike to 96.1% on claw-gateway1 triggered MEDIUM severity alert at 00:00 UTC on 2026-05-08; likely caused by runaway process or service memory leak. - **Resolution:** CPU automatically normalized to 22.2% within 21 minutes; alert auto-resolved after two consecutive checks confirmed sustained drop below 70% threshold. - **Current State:** RESOLVED. Host is healthy with CPU at 22.2% as of 00:21 UTC. No manual intervention was required. - **Watch For:** Monitor claw-gateway1 CPU metrics over the next shift for signs of recurrence. If CPU spikes again, check `top` for runaway processes and consider service restart per runbook; escalate to oncall.ado-runner.com if CPU persists >80% after restarts.
·
HANDOFF
2026-05-29 04:57 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **What Happened:** CPU spiked to 96.1% on claw-gateway1 at 00:00 UTC on 2026-05-08, triggering a MEDIUM severity alert. Likely caused by a runaway process, memory leak, or inefficient query loop. - **Resolution:** Alert auto-resolved at 00:21 UTC after CPU dropped to 22.2% and remained below 70% threshold for two consecutive checks. No manual intervention was required. - **Current State:** RESOLVED. claw-gateway1 is operating normally with CPU sustained at 22.2%. - **Watch For:** Monitor for CPU spike recurrence on claw-gateway1. If CPU exceeds 90% again, investigate with `top` to identify the culprit process, and escalate to oncall.ado-runner.com if restarts don't resolve the issue within 3 attempts. - **Runbook Available:** Documentation and past incident context are available for reference if this alert triggers again.
·
HANDOFF
2026-05-31 14:59 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **What Happened:** CPU spiked to 96.1% on claw-gateway1 at 00:00 UTC on 2026-05-08, triggering a MEDIUM severity alert. Likely caused by a runaway process or service memory leak. - **Resolution:** Alert auto-resolved at 00:21 UTC after CPU dropped to 22.2% and remained below 70% threshold for two consecutive checks. No manual intervention was required. - **Current State:** Incident is RESOLVED. claw-gateway1 is operating normally with CPU at 22.2%. - **Watch For:** Monitor for CPU spikes returning to claw-gateway1. If recurrence occurs, investigate via `top -b -n 1` to identify the culprit process, then stop/restart the affected ADOStack service. Escalate if CPU persists >80% after restarts.
·
HANDOFF
2026-05-31 17:20 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **What Happened:** CPU spiked to 96.1% on claw-gateway1 at 00:00 UTC on 2026-05-08, triggering a MEDIUM severity alert. Likely caused by a runaway process or service memory leak. - **Resolution:** Alert auto-resolved at 00:21 UTC after CPU dropped to 22.2% and remained stable for two consecutive checks (below 70% threshold). No manual intervention was required. - **Current State:** Incident is fully resolved and closed. claw-gateway1 is operating normally with CPU at safe levels. - **Next Steps / Watch For:** Monitor claw-gateway1 for any recurring CPU spikes. If the issue recurs, investigate running processes with `top` and check for memory leaks or inefficient query loops. Escalate to oncall if CPU sustains above 80% after service restart attempts.
·
HANDOFF
2026-06-06 10:43 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **What Happened:** CPU spike to 96.1% on claw-gateway1 triggered MEDIUM severity alert at 00:00 UTC on 2026-05-08. Likely caused by runaway process or service memory leak. - **Resolution:** Incident auto-resolved at 00:21 UTC after CPU dropped to 22.2% and remained stable below 70% threshold for 2 consecutive monitoring checks (~5 min apart). - **Current State:** ✅ RESOLVED. claw-gateway1 CPU nominal at 22.2%; no manual intervention was required. - **Watch For:** Monitor for CPU spike recurrence on claw-gateway1. If CPU exceeds 90% again, use `top -b -n 1` to identify runaway processes, then escalate to oncall.ado-runner.com if restart attempts don't resolve.
·
HANDOFF
2026-06-12 12:10 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **What Happened:** CPU spiked to 96.1% on claw-gateway1 at 00:00 UTC on 2026-05-08, triggering a MEDIUM severity alert. Likely caused by a runaway process or service memory leak. - **Resolution:** Alert auto-resolved at 00:21 UTC after CPU dropped to 22.2% and remained stable below 70% threshold for two consecutive checks. No manual intervention was required. - **Current State:** RESOLVED. claw-gateway1 operating normally with CPU at 22.2%. No active alerts. - **Watch For:** Monitor claw-gateway1 for CPU spikes over the next shift. If CPU returns to >90%, check for runaway processes via `top` and review service logs for memory leaks or inefficient query loops before escalating to on-call.
·
HANDOFF
2026-06-12 23:20 UTC
Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident:** CPU spiked to 96.1% on claw-gateway1 at 00:00 UTC on 2026-05-08, triggering a MEDIUM severity alert. Likely caused by a runaway process or service memory leak. - **Resolution:** Alert auto-resolved at 00:21 UTC after CPU dropped to 22.2% and remained stable for two consecutive checks below 70% threshold. - **Current State:** RESOLVED. claw-gateway1 operating normally with CPU at 22.2%. No manual intervention was required. - **Watch For:** Monitor claw-gateway1 CPU over the next shift for any recurrence of spikes above 90%. If CPU issues persist, investigate via `top` to identify resource-consuming processes, and consider escalating for deeper service diagnostics.
Update Status
Details
ID #23
Severity MEDIUM
Source infra_monitor
Status RESOLVED
Opened 2026-05-08 00:00