Live
#37 high infra_monitor
Infra Monitor: Critical CPU usage spike requires immediate investigation.
Host: claw-gateway1 CPU usage has spiked to 97.1% and is trending sharply upward with a 70.6% increase over the last 5 readings, indicating a potential runaway process or resource exhaustion event. All other metrics remain healthy: memory at 51.9%, disk utilization well below thresholds, and process count nominal at 140. Immediate action needed to identify and remediate the CPU load source. CPU: 97.1% | Memory: 51.9% Anomalies: CPU usage critically high at 97.1% (threshold: >95%), CPU trending upward with +70.6% increase over last 5 readings
Opened 2026-05-27 00:03 UTC · Resolved 2026-05-27 00:39 UTC
Handoff Notes ← Dashboard
Timeline
WEBHOOK
2026-05-27 00:03 UTC
Alert received from AI Infra Monitor. Host: claw-gateway1, Severity: HIGH
CONTEXT AGGREGATED
2026-05-27 00:03 UTC
Sources available: 3/3 — Runbook: ✓ | Past incidents: ✓ | Infra health: ✓
Response Plan
2026-05-27 00:03 UTC

Severity

P1: Single gateway host at 97.1% CPU with upward trend; risk of service degradation or failover cascade if sustained.

Root Cause

  • Runaway process consuming CPU (check top/htop for offender)
  • Recent deployment or config change on claw-gateway1
  • Resource contention from batch job or scheduled task

Actions

  1. SSH to claw-gateway1; run top -b -n1 | head -20 to identify top CPU consumer.
  2. If process identified: kill or restart it; if unknown/critical: trigger graceful drain to peer gateway.
  3. Confirm CPU drops below 85% within 2 minutes; if not, escalate to platform team.
  4. Check deployment logs and cron jobs for recent changes that correlate with spike start time.
  5. Re-enable alerts once stable for 5+ minutes.

Watch

  • CPU usage (target <75% sustained).
  • Request latency and error rates on claw-gateway endpoints.

Escalate If

CPU remains >90% after process kill or no offending process identified.

STATUS CHANGE
2026-05-27 00:05 UTC
OPEN -> INVESTIGATING (auto-remediation approved via Telegram)
WEBHOOK
2026-05-27 00:05 UTC
Auto-remediation approved by on-call engineer via Telegram.
STATUS CHANGE
2026-05-27 00:34 UTC
Auto-resolver: CPU at 29.5% (below 70% clear threshold) — clean check 1/2
STATUS CHANGE
2026-05-27 00:34 UTC
Auto-resolver: CPU at 29.5% (below 70% clear threshold) — clean check 1/2
STATUS CHANGE
2026-05-27 00:39 UTC
Auto-resolver: CPU at 29.5% (below 70% clear threshold) — clean check 2/2
STATUS CHANGE
2026-05-27 00:39 UTC
Auto-resolver: CPU at 29.5% (below 70% clear threshold) — clean check 2/2
STATUS CHANGE
2026-05-27 00:39 UTC
AUTO-RESOLVED: CPU sustained below 70% for 2 consecutive checks. Current value: 29.5%
STATUS CHANGE
2026-05-27 00:39 UTC
AUTO-RESOLVED: CPU sustained below 70% for 2 consecutive checks. Current value: 29.5%
·
HANDOFF
2026-05-29 04:57 UTC
Handoff notes generated: # Shift Handoff Notes - **Incident**: Critical CPU spike on claw-gateway1 (peaked at 97.1%) triggered P1 alert at 00:03 UTC on 2026-05-27 - **Action Taken**: Auto-remediation was approved and executed; CPU dropped to 29.5% and remained stable for 2 consecutive checks; incident auto-resolved at 00:39 UTC - **Current State**: claw-gateway1 operating normally with CPU at 29.5% (well below 70% threshold); no active issues - **Root Cause**: Not definitively identified in automated response; likely runaway process, recent deployment, or batch job contention (see AI plan in ticket) - **Next Steps**: Monitor claw-gateway1 CPU trends over next shift; if spike recurs, manually investigate with `top` to identify root cause; review recent deployments/config changes to prevent repeat
·
HANDOFF
2026-05-31 15:00 UTC
Handoff notes generated: # Shift Handoff Notes - **Incident Summary**: P1 CPU spike on claw-gateway1 peaked at 97.1% on 2026-05-27 at 00:03 UTC; auto-remediation was approved and executed. - **Resolution**: CPU normalized to 29.5% within ~37 minutes; incident auto-resolved after 2 consecutive clean checks below 70% threshold. - **Current State**: claw-gateway1 is healthy and stable. No manual intervention was required beyond approval; root cause of the spike was not explicitly identified in logs. - **Watch For**: Monitor claw-gateway1 CPU trends over the next 24–48 hours for recurrence. If spike returns, investigate runaway processes, recent deployments, or scheduled batch jobs using `top`/`htop` and review application logs. - **Follow-up**: Consider post-incident review to identify what triggered the spike (process, deployment, or resource contention) to prevent future occurrences.
·
HANDOFF
2026-05-31 22:47 UTC
Handoff notes generated: # Shift Handoff Notes - **Incident**: P1 CPU spike on claw-gateway1 peaked at 97.1% on 2026-05-27 at 00:03 UTC; auto-remediation was approved and executed. - **Resolution**: Auto-remediation successfully resolved the issue; CPU dropped to 29.5% and remained stable across two consecutive validation checks before auto-closure at 00:39 UTC. - **Current State**: RESOLVED. Host is operating normally with CPU at safe levels; no manual intervention required. - **Root Cause**: Likely runaway process or resource contention (specific process not identified in logs). Recommend reviewing claw-gateway1 process history and any recent deployments/config changes if spike recurs. - **Watch For**: Monitor claw-gateway1 CPU metrics over next 24-48 hours for any repeat spikes; if CPU creeps above 70% again, escalate for deeper investigation (process identification, deployment review, batch job interference).
·
HANDOFF
2026-06-06 10:43 UTC
Handoff notes generated: # Shift Handoff Notes - **Incident**: P1 CPU spike on claw-gateway1 peaked at 97.1% on 2026-05-27 at 00:03 UTC; auto-remediation was approved and executed via Telegram by on-call engineer. - **Resolution**: CPU stabilized to 29.5% within ~37 minutes; incident auto-resolved after passing 2 consecutive health checks below 70% threshold at 00:39 UTC. - **Current State**: claw-gateway1 operating normally with CPU sustained at safe levels; no lingering alerts or service degradation observed. - **Root Cause**: Not explicitly identified in logs—runaway process suspected but remediation action (likely process kill/restart) resolved issue without manual intervention. - **Watch For**: Monitor claw-gateway1 CPU metrics over next 24-48 hours for recurrence; if spike returns, manually investigate via `top/htop` to identify root process and consider permanent fix (deployment rollback, config adjustment, or scheduled task optimization).
·
HANDOFF
2026-06-09 12:13 UTC
Handoff notes generated: # Shift Handoff Notes - **Incident**: P1 CPU spike on claw-gateway1 peaked at 97.1% on 2026-05-27 00:03 UTC; auto-remediation was approved and executed via Telegram. - **Resolution**: CPU stabilized to 29.5% within ~37 minutes; auto-resolver confirmed 2 consecutive clean checks below 70% threshold and auto-resolved at 00:39 UTC. - **Current State**: Incident RESOLVED; claw-gateway1 operating normally with CPU sustained well below alert threshold. - **Root Cause**: Likely runaway process or resource contention—underlying cause was not explicitly identified in logs. Recommend reviewing host logs/process history if spike recurs. - **Watch For**: Monitor claw-gateway1 CPU metrics closely over next shift; if spike returns, manually SSH and run `top` to identify the offending process before auto-remediation triggers again.
Update Status
Details
ID #37
Severity HIGH
Source infra_monitor
Status RESOLVED
Opened 2026-05-27 00:03