minions-ai-agents/antigravity_brain_export/knowledge/observability_standards.md

# 🕵️ Observability & Logging Standards (The "Black Box" Recorder)

**Audience:** Developers & SRE Agents (Arthur Mendes).
**Objective:** Turn "It crashed" into "It crashed at line 42 because of variable X".

> [!CRITICAL]
> **The Arthur Mandate:**
> "Logs are for machines first, humans second. If I cannot `grep` or `jq` your log, it is useless noise."

## 1. 📝 Logging Protocol (Structured & Standardized)

### The Format: JSON Lines
Text logs are dead. Long live JSON.
**Why?** Because Zabbix, ELK, and simple Python scripts can parse it instantly.

**❌ BAD (Unstructured Text):**
`[2024-01-01 10:00] ERROR: User failed login. ID: 123`

**✅ GOOD (Structured JSON):**
```json
{"timestamp": "2024-01-01T10:00:00Z", "level": "ERROR", "event": "auth_failure", "user_id": 123, "reason": "invalid_password", "correlation_id": "abc-123"}
```

### The 4 Levels of Severity
1.  **DEBUG**: Raw payloads, variable states. *Enable only during dev/diagnosis.*
2.  **INFO**: Business Events. "User Created", "Job Started", "Payment Processed".
3.  **WARN**: Recoverable issues. "Retry 1/3 failed", "Config missing (using default)".
4.  **ERROR**: Operator intervention required. Stack traces, DB connection loss.

### The "12-Factor" Output
*   **Containers:** Always write to `stdout` / `stderr`. Never write to local files inside a container (unless using a shared volume for specific audit trails).
*   **Traceability:** Every request/job MUST generate a `correlation_id` (UUID) at the edge and pass it down the stack.

## 2. 👁️ Monitoring & Zabbix Integration

Our monitoring ecosystem is centered on **Zabbix**. Your application must be "Zabbix-Friendly".

### A. The Health Endpoint (`/health`)
Every HTTP service MUST allow a Zabbix Web Scenario to check it.
*   **Path:** `/health`
*   **Response:** `200 OK` (JSON body optional but recommended: `{"status": "up", "db": "up"}`).
*   **Timeout:** Must respond in < 200ms.

### B. The "Zabbix Trapper" Pattern (Push Metrics)
For batch jobs or async workers (CrewAI Agents), do not wait to be scraped. **PUSH** the metric using `zabbix_sender`.

*   **Python:** Use `py-zabbix` or raw socket.
*   **Key Standard:** `app.module.metric` (e.g., `crew.infra.task_duration`).
*   **When to use:** When a specific task finishes (e.g., "Agent compiled report in 45s").

### C. Log Monitoring Keywords
If you must rely on Log Scrapers (Zabbix Agent Active), use these "Trigger Keywords" to wake up Arthur:
*   `[SECURITY_BREACH]`: Immediate high alert.
*   `[DATA_LOSS]`: Critical alert.
*   `[DEPRECATED]`: Warning alert.

## 3. 🚨 Alerting Philosophy (Actionable Intelligence)

Do not log an error if you already handled it gracefully.
*   **Handled Error:** Log as `WARN`. (No pager).
*   **Unhandled Code:** Log as `ERROR`. (Wake up Arthur).

## 4. 🤖 The Agent "Self-Diagnosis" (Log Audit)

Before considering a task complete, run this audit:

- [ ] **JSON Check:** Is my output parsable?
- [ ] **Context:** Did I include `user_id`, `file_name`, or `job_id`?
- [ ] **Silence:** Did I remove all `print(var)` statements used for debugging?
- [ ] **Correlation:** Can I trace this error back to the user request?