Severity
P1: claw-gateway1 CPU at 95.7% with +67.1% upward trend; gateway availability at imminent risk.
Root Cause
- Runaway process consuming CPU (memory/disk healthy suggests isolated CPU leak)
- Inefficient query or loop in running service
Actions
- SSH to claw-gateway1 (161.35.229.80) and run
top -b -n1 | head -20 + ps aux --sort=-%cpu | head -10 to identify offender.
- Confirm process legitimacy; if rogue/stuck, kill it (
kill -9 PID) and document.
- Restart affected service if process is a known worker/daemon.
- Verify CPU drop within 2 minutes; confirm no memory spike post-restart.
- Check Infra Monitor and incident logs for pattern (similar incidents show no resolution—may be recurring).
Watch
- CPU utilization trend (target: <70% within 5 min, <50% within 10 min).
- Process restart/recovery metrics; alert if same process respawns at high CPU.
Escalate If
CPU remains >90% after process kill, or if restart triggers cascading failures on dependent services.