2.0 KiB

Raw Blame History

🚒 Technical Incident Standards (The "Firefighter" Protocol)

Audience: Gus Fring (Stability) & The Architect. Objective: Restore service first. Ask questions later.

[!CRITICAL] The Fring Mandate: "Chaos is bad for business. When the alarm rings, you do not debate. You execute the protocol."

1. 🚨 Severity Classification (Defcon Levels)

SEV-1 (Critical): System Down. Data Loss. Security Breach.
- Response: Immediate. Wake up everyone.
- SLA: < 15 mins to Acknowledge.
SEV-2 (High): Major feature broken (e.g., Checkout). Workaround exists but is painful.
- Response: < 1 hour.
SEV-3 (Medium): Minor bug or annoyance.
- Response: Business hours.

2. 🛡️ The "War Room" Protocol (During Incident)

Containment: Stop the bleeding.
- Action: Rollback the deployment immediately.
- Command: docker compose rollback (or equivalent).
Communication:
- Public: "We are investigating an issue." (Do not blame tech).
- Internal: "Incident Commander is [Name]."

3. 🔙 Rollback Policy

The "Golden Rule": If a deployment fails health checks for > 2 mins, AUTO-ROLLBACK.
Database: DB Migrations must be backwards compatible.
- Ban: Renaming a column in the same deploy as code usage change.
- Strategy: Add new column -> Sync -> Deprecate old -> Remove old.

4. 📝 Post-Incident Review (The "Blameless" Post-Mortem)

After the dust settles (SEV-1/SEV-2 only):

Artifact: Create docs/ops/incident_reports/YYYY-MM-DD-incident.md.
The 5 Whys: Drill down to the root cause (process failure, not human error).
Action Items: Create Jira/Task to fix the Process so it never happens again.

5. 🧯 The Firefighter's Checklist

During an alert:

Status Page: Is it updated?
Logs: Are we capturing the error traces? (observability_standards.md).
Rollback: Is the previous image available?
Silence: Did we mute non-critical alerts to focus?

Powered by TurnKey Linux.