# 🚒 Technical Incident Standards (The "Firefighter" Protocol) **Audience:** Gus Fring (Stability) & The Architect. **Objective:** Restore service first. Ask questions later. > [!CRITICAL] > **The Fring Mandate:** > "Chaos is bad for business. When the alarm rings, you do not debate. You execute the protocol." ## 1. 🚨 Severity Classification (Defcon Levels) * **SEV-1 (Critical):** System Down. Data Loss. Security Breach. * *Response:* Immediate. Wake up everyone. * *SLA:* < 15 mins to Acknowledge. * **SEV-2 (High):** Major feature broken (e.g., Checkout). Workaround exists but is painful. * *Response:* < 1 hour. * **SEV-3 (Medium):** Minor bug or annoyance. * *Response:* Business hours. ## 2. 🛡️ The "War Room" Protocol (During Incident) 1. **Containment:** Stop the bleeding. * *Action:* Rollback the deployment immediately. * *Command:* `docker compose rollback` (or equivalent). 2. **Communication:** * **Public:** "We are investigating an issue." (Do not blame tech). * **Internal:** "Incident Commander is [Name]." ## 3. 🔙 Rollback Policy * **The "Golden Rule":** If a deployment fails health checks for > 2 mins, AUTO-ROLLBACK. * **Database:** DB Migrations must be backwards compatible. * *Ban:* Renaming a column in the same deploy as code usage change. * *Strategy:* Add new column -> Sync -> Deprecate old -> Remove old. ## 4. 📝 Post-Incident Review (The "Blameless" Post-Mortem) After the dust settles (SEV-1/SEV-2 only): 1. **Artifact:** Create `docs/ops/incident_reports/YYYY-MM-DD-incident.md`. 2. **The 5 Whys:** Drill down to the root cause (process failure, not human error). 3. **Action Items:** Create Jira/Task to fix the *Process* so it never happens again. ## 5. 🧯 The Firefighter's Checklist During an alert: - [ ] **Status Page:** Is it updated? - [ ] **Logs:** Are we capturing the error traces? (`observability_standards.md`). - [ ] **Rollback:** Is the previous image available? - [ ] **Silence:** Did we mute non-critical alerts to focus?