2.0 KiB
2.0 KiB
🚒 Technical Incident Standards (The "Firefighter" Protocol)
Audience: Gus Fring (Stability) & The Architect. Objective: Restore service first. Ask questions later.
[!CRITICAL] The Fring Mandate: "Chaos is bad for business. When the alarm rings, you do not debate. You execute the protocol."
1. 🚨 Severity Classification (Defcon Levels)
- SEV-1 (Critical): System Down. Data Loss. Security Breach.
- Response: Immediate. Wake up everyone.
- SLA: < 15 mins to Acknowledge.
- SEV-2 (High): Major feature broken (e.g., Checkout). Workaround exists but is painful.
- Response: < 1 hour.
- SEV-3 (Medium): Minor bug or annoyance.
- Response: Business hours.
2. 🛡️ The "War Room" Protocol (During Incident)
- Containment: Stop the bleeding.
- Action: Rollback the deployment immediately.
- Command:
docker compose rollback(or equivalent).
- Communication:
- Public: "We are investigating an issue." (Do not blame tech).
- Internal: "Incident Commander is [Name]."
3. 🔙 Rollback Policy
- The "Golden Rule": If a deployment fails health checks for > 2 mins, AUTO-ROLLBACK.
- Database: DB Migrations must be backwards compatible.
- Ban: Renaming a column in the same deploy as code usage change.
- Strategy: Add new column -> Sync -> Deprecate old -> Remove old.
4. 📝 Post-Incident Review (The "Blameless" Post-Mortem)
After the dust settles (SEV-1/SEV-2 only):
- Artifact: Create
docs/ops/incident_reports/YYYY-MM-DD-incident.md. - The 5 Whys: Drill down to the root cause (process failure, not human error).
- Action Items: Create Jira/Task to fix the Process so it never happens again.
5. 🧯 The Firefighter's Checklist
During an alert:
- Status Page: Is it updated?
- Logs: Are we capturing the error traces? (
observability_standards.md). - Rollback: Is the previous image available?
- Silence: Did we mute non-critical alerts to focus?