Live
#53 high infra_monitor
Infra Monitor: Critical CPU usage at 95.7% with sharp upward trend.
Host: claw-gateway1 CPU utilization has reached critical threshold at 95.7% and is trending sharply upward (+67.1% over last 5 readings), requiring immediate investigation. Memory, disk, and process counts remain healthy. The rapid CPU escalation suggests a runaway process or resource leak that demands urgent attention to prevent system degradation. CPU: 95.7% | Memory: 42.7% Anomalies: CPU usage at 95.7% (critical threshold breached), CPU trending upward +67.1% over last 5 readings (concerning trajectory)
Opened 2026-06-12 00:03 UTC · Resolved 2026-06-12 00:10 UTC
Handoff Notes ← Dashboard
Timeline
WEBHOOK
2026-06-12 00:03 UTC
Alert received from AI Infra Monitor. Host: claw-gateway1, Severity: HIGH
CONTEXT AGGREGATED
2026-06-12 00:03 UTC
Sources available: 3/3 — Runbook: ✓ | Past incidents: ✓ | Infra health: ✓
Response Plan
2026-06-12 00:03 UTC

Severity

P1: claw-gateway1 CPU at 95.7% with +67.1% upward trend; gateway availability at imminent risk.

Root Cause

  • Runaway process consuming CPU (memory/disk healthy suggests isolated CPU leak)
  • Inefficient query or loop in running service

Actions

  1. SSH to claw-gateway1 (161.35.229.80) and run top -b -n1 | head -20 + ps aux --sort=-%cpu | head -10 to identify offender.
  2. Confirm process legitimacy; if rogue/stuck, kill it (kill -9 PID) and document.
  3. Restart affected service if process is a known worker/daemon.
  4. Verify CPU drop within 2 minutes; confirm no memory spike post-restart.
  5. Check Infra Monitor and incident logs for pattern (similar incidents show no resolution—may be recurring).

Watch

  • CPU utilization trend (target: <70% within 5 min, <50% within 10 min).
  • Process restart/recovery metrics; alert if same process respawns at high CPU.

Escalate If

CPU remains >90% after process kill, or if restart triggers cascading failures on dependent services.

STATUS CHANGE
2026-06-12 00:10 UTC
Resolved after verification: live CPU is 11.6%, memory 50.8%, no sustained high-CPU process remains. Duplicate medium alert path has been adjusted to stop opening On-Call incidents for yellow threshold breaches.
Update Status
Details
ID #53
Severity HIGH
Source infra_monitor
Status RESOLVED
Opened 2026-06-12 00:03