minions-ai-agents/antigravity_brain_export/knowledge/observability_standards.md

71 lines
3.1 KiB
Markdown

# 🕵️ Observability & Logging Standards (The "Black Box" Recorder)
**Audience:** Developers & SRE Agents (Arthur Mendes).
**Objective:** Turn "It crashed" into "It crashed at line 42 because of variable X".
> [!CRITICAL]
> **The Arthur Mandate:**
> "Logs are for machines first, humans second. If I cannot `grep` or `jq` your log, it is useless noise."
## 1. 📝 Logging Protocol (Structured & Standardized)
### The Format: JSON Lines
Text logs are dead. Long live JSON.
**Why?** Because Zabbix, ELK, and simple Python scripts can parse it instantly.
**❌ BAD (Unstructured Text):**
`[2024-01-01 10:00] ERROR: User failed login. ID: 123`
**✅ GOOD (Structured JSON):**
```json
{"timestamp": "2024-01-01T10:00:00Z", "level": "ERROR", "event": "auth_failure", "user_id": 123, "reason": "invalid_password", "correlation_id": "abc-123"}
```
### The 4 Levels of Severity
1. **DEBUG**: Raw payloads, variable states. *Enable only during dev/diagnosis.*
2. **INFO**: Business Events. "User Created", "Job Started", "Payment Processed".
3. **WARN**: Recoverable issues. "Retry 1/3 failed", "Config missing (using default)".
4. **ERROR**: Operator intervention required. Stack traces, DB connection loss.
### The "12-Factor" Output
* **Containers:** Always write to `stdout` / `stderr`. Never write to local files inside a container (unless using a shared volume for specific audit trails).
* **Traceability:** Every request/job MUST generate a `correlation_id` (UUID) at the edge and pass it down the stack.
## 2. 👁️ Monitoring & Zabbix Integration
Our monitoring ecosystem is centered on **Zabbix**. Your application must be "Zabbix-Friendly".
### A. The Health Endpoint (`/health`)
Every HTTP service MUST allow a Zabbix Web Scenario to check it.
* **Path:** `/health`
* **Response:** `200 OK` (JSON body optional but recommended: `{"status": "up", "db": "up"}`).
* **Timeout:** Must respond in < 200ms.
### B. The "Zabbix Trapper" Pattern (Push Metrics)
For batch jobs or async workers (CrewAI Agents), do not wait to be scraped. **PUSH** the metric using `zabbix_sender`.
* **Python:** Use `py-zabbix` or raw socket.
* **Key Standard:** `app.module.metric` (e.g., `crew.infra.task_duration`).
* **When to use:** When a specific task finishes (e.g., "Agent compiled report in 45s").
### C. Log Monitoring Keywords
If you must rely on Log Scrapers (Zabbix Agent Active), use these "Trigger Keywords" to wake up Arthur:
* `[SECURITY_BREACH]`: Immediate high alert.
* `[DATA_LOSS]`: Critical alert.
* `[DEPRECATED]`: Warning alert.
## 3. 🚨 Alerting Philosophy (Actionable Intelligence)
Do not log an error if you already handled it gracefully.
* **Handled Error:** Log as `WARN`. (No pager).
* **Unhandled Code:** Log as `ERROR`. (Wake up Arthur).
## 4. 🤖 The Agent "Self-Diagnosis" (Log Audit)
Before considering a task complete, run this audit:
- [ ] **JSON Check:** Is my output parsable?
- [ ] **Context:** Did I include `user_id`, `file_name`, or `job_id`?
- [ ] **Silence:** Did I remove all `print(var)` statements used for debugging?
- [ ] **Correlation:** Can I trace this error back to the user request?