minions-ai-agents/antigravity_brain_export/knowledge/tech_incident_standards.md

2.0 KiB

🚒 Technical Incident Standards (The "Firefighter" Protocol)

Audience: Gus Fring (Stability) & The Architect. Objective: Restore service first. Ask questions later.

[!CRITICAL] The Fring Mandate: "Chaos is bad for business. When the alarm rings, you do not debate. You execute the protocol."

1. 🚨 Severity Classification (Defcon Levels)

  • SEV-1 (Critical): System Down. Data Loss. Security Breach.
    • Response: Immediate. Wake up everyone.
    • SLA: < 15 mins to Acknowledge.
  • SEV-2 (High): Major feature broken (e.g., Checkout). Workaround exists but is painful.
    • Response: < 1 hour.
  • SEV-3 (Medium): Minor bug or annoyance.
    • Response: Business hours.

2. 🛡️ The "War Room" Protocol (During Incident)

  1. Containment: Stop the bleeding.
    • Action: Rollback the deployment immediately.
    • Command: docker compose rollback (or equivalent).
  2. Communication:
    • Public: "We are investigating an issue." (Do not blame tech).
    • Internal: "Incident Commander is [Name]."

3. 🔙 Rollback Policy

  • The "Golden Rule": If a deployment fails health checks for > 2 mins, AUTO-ROLLBACK.
  • Database: DB Migrations must be backwards compatible.
    • Ban: Renaming a column in the same deploy as code usage change.
    • Strategy: Add new column -> Sync -> Deprecate old -> Remove old.

4. 📝 Post-Incident Review (The "Blameless" Post-Mortem)

After the dust settles (SEV-1/SEV-2 only):

  1. Artifact: Create docs/ops/incident_reports/YYYY-MM-DD-incident.md.
  2. The 5 Whys: Drill down to the root cause (process failure, not human error).
  3. Action Items: Create Jira/Task to fix the Process so it never happens again.

5. 🧯 The Firefighter's Checklist

During an alert:

  • Status Page: Is it updated?
  • Logs: Are we capturing the error traces? (observability_standards.md).
  • Rollback: Is the previous image available?
  • Silence: Did we mute non-critical alerts to focus?