#39 — CPU alert on claw-gateway1 — 90.0% (threshold: 90%)

⬡

WEBHOOK

2026-05-27 16:50 UTC

Alert received from AI Infra Monitor. Host: claw-gateway1, Severity: MEDIUM

△

STATUS CHANGE

2026-05-27 16:50 UTC

OPEN -> INVESTIGATING (auto - low/medium severity)

◎

CONTEXT AGGREGATED

2026-05-27 16:50 UTC

Sources available: 3/3 — Runbook: ✓ | Past incidents: ✓ | Infra health: ✓

✦

Response Plan

2026-05-27 16:50 UTC

Severity

Medium: CPU at threshold; no impact yet, but upward trend requires immediate investigation.

Root Cause

Runaway process or resource leak on claw-gateway1
Traffic spike or batch job consuming CPU

Actions

SSH to claw-gateway1; run top -b -n 1 | head -20 to identify top CPU consumers.
If single process >50% CPU: kill/restart it; if distributed load: check active connections with netstat -an | grep ESTABLISHED | wc -l.
If CPU remains >80% after 3 minutes, restart ADOStack services: systemctl restart ado-stack.
If CPU still >80% after restart, trigger host reboot (2–3 min downtime expected).
Post-incident: review logs for anomalies; check for cron jobs or scheduled tasks.

Watch

CPU trend (alert if sustained >85% for 5 min; escalate at >95%).
Process list for memory leaks or stuck threads.

Escalate If

CPU remains >85% after service restart or top process is unidentifiable.

△

STATUS CHANGE

2026-05-27 17:04 UTC

Auto-resolver: CPU at 33.8% (below 70% clear threshold) — clean check 1/2

△

STATUS CHANGE

2026-05-27 17:05 UTC

Auto-resolver: CPU at 33.8% (below 70% clear threshold) — clean check 1/2

△

STATUS CHANGE

2026-05-27 17:09 UTC

Auto-resolver: CPU at 33.8% (below 70% clear threshold) — clean check 2/2

△

STATUS CHANGE

2026-05-27 17:10 UTC

AUTO-RESOLVED: CPU sustained below 70% for 2 consecutive checks. Current value: 33.8%

·

HANDOFF

2026-05-29 04:57 UTC

Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident Summary:** CPU alert triggered on claw-gateway1 at 16:50 UTC when usage hit 90% (medium severity). Likely causes identified as runaway process, resource leak, or traffic spike. - **Resolution:** CPU auto-resolved at 17:10 UTC after dropping to 33.8% and remaining below 70% threshold for 2 consecutive checks (~5 min duration). No manual intervention was required. - **Current State:** claw-gateway1 operating normally with CPU at 33.8%. Incident marked RESOLVED. - **Watch for:** Monitor for CPU spikes trending back toward threshold. If recurrence occurs, investigate top CPU consumers with `top` command and check active connections with `netstat` to identify resource-intensive processes or traffic anomalies. - **No further action required** unless alert re-triggers; escalate to platform team if pattern repeats.

·

HANDOFF

2026-05-31 15:00 UTC

Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident Summary:** CPU alert (90%) triggered on claw-gateway1 at 16:50 UTC on 2026-05-27; medium severity. Auto-resolved at 17:10 UTC after CPU dropped to 33.8% and sustained below 70% threshold for 2 consecutive checks. - **Root Cause:** Not definitively identified. Suspected runaway process, resource leak, traffic spike, or batch job, but CPU normalized before manual investigation was needed. - **Resolution:** Alert auto-resolved via monitoring thresholds; no manual intervention required. Host returned to healthy state within ~20 minutes. - **Current State:** claw-gateway1 operating normally at 33.8% CPU as of last check. Incident marked RESOLVED. - **Next Steps for On-Shift Team:** Monitor claw-gateway1 for CPU spikes over next 24–48 hours. If alert re-triggers, SSH in and run `top -b -n 1 | head -20` to identify top CPU consumers and `netstat -an | grep ESTABLISHED | wc -l` to check connection count. If sustained >80%, consider restart or escalation.

·

HANDOFF

2026-06-01 04:39 UTC

Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident:** CPU alert (90%) triggered on claw-gateway1 at 16:50 UTC on 2026-05-27 (medium severity); auto-resolved at 17:10 UTC after CPU dropped to 33.8% and sustained below 70% for 2 consecutive checks. - **Root Cause:** Likely a transient spike—either a runaway process, resource leak, or temporary traffic/batch job spike. No specific process was identified before auto-resolution. - **Current State:** RESOLVED. CPU currently stable at 33.8% with no ongoing alerts. - **Watch For:** Monitor claw-gateway1 for CPU creep or recurring spikes >80%. If this recurs, SSH in and run `top -b -n 1 | head -20` to identify the culprit (single process vs. distributed load), then escalate or restart as needed. - **No Further Action Required:** Incident resolved automatically; escalate only if pattern repeats within next shift.

·

HANDOFF

2026-06-06 10:42 UTC

Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident:** CPU alert triggered on claw-gateway1 at 16:50 UTC (2026-05-27) when usage spiked to 90%; medium severity. - **Resolution:** Alert auto-resolved at 17:10 UTC after CPU dropped to 33.8% and remained below 70% threshold for 2 consecutive checks (~5 min duration). - **Current State:** Host is healthy with CPU at normal levels; no manual intervention was required. - **Likely Cause:** Suspected runaway process, resource leak, or temporary traffic/batch job spike—root cause not definitively identified due to auto-resolution. - **Watch For:** Monitor claw-gateway1 CPU trends over next shift. If alert recurs, SSH in and run `top -b -n 1 | head -20` to identify top CPU consumers; check active connections with `netstat -an | grep ESTABLISHED | wc -l` if load appears distributed.

·

HANDOFF

2026-06-09 09:32 UTC

Handoff notes generated: # Shift Handoff Notes: claw-gateway1 CPU Alert - **Incident:** CPU spike to 90% on claw-gateway1 triggered medium-severity alert on 2026-05-27 at 16:50 UTC; auto-resolved at 17:10 UTC after CPU dropped to 33.8% and sustained below 70% for two consecutive checks. - **Root Cause:** Not definitively identified; likely triggered by temporary traffic spike, batch job, or transient process, as CPU normalized without manual intervention. - **Current State:** RESOLVED. claw-gateway1 CPU stable at 33.8% with no active alerts. Host is operating normally. - **Watch For:** Monitor claw-gateway1 CPU trends over next shift. If alert recurs or CPU shows sustained upward trend >80%, investigate running processes with `top` and check for runaway jobs or traffic spikes. Review connection count with `netstat -an | grep ESTABLISHED | wc -l` if needed. - **Next Steps:** No immediate action required. Consider reviewing logs from 16:45–17:15 UTC if pattern repeats to identify root cause.