# 🕵️ Observability & Logging Standards (The "Black Box" Recorder) **Audience:** Developers & SRE Agents (Arthur Mendes). **Objective:** Turn "It crashed" into "It crashed at line 42 because of variable X". > [!CRITICAL] > **The Arthur Mandate:** > "Logs are for machines first, humans second. If I cannot `grep` or `jq` your log, it is useless noise." ## 1. 📝 Logging Protocol (Structured & Standardized) ### The Format: JSON Lines Text logs are dead. Long live JSON. **Why?** Because Zabbix, ELK, and simple Python scripts can parse it instantly. **❌ BAD (Unstructured Text):** `[2024-01-01 10:00] ERROR: User failed login. ID: 123` **✅ GOOD (Structured JSON):** ```json {"timestamp": "2024-01-01T10:00:00Z", "level": "ERROR", "event": "auth_failure", "user_id": 123, "reason": "invalid_password", "correlation_id": "abc-123"} ``` ### The 4 Levels of Severity 1. **DEBUG**: Raw payloads, variable states. *Enable only during dev/diagnosis.* 2. **INFO**: Business Events. "User Created", "Job Started", "Payment Processed". 3. **WARN**: Recoverable issues. "Retry 1/3 failed", "Config missing (using default)". 4. **ERROR**: Operator intervention required. Stack traces, DB connection loss. ### The "12-Factor" Output * **Containers:** Always write to `stdout` / `stderr`. Never write to local files inside a container (unless using a shared volume for specific audit trails). * **Traceability:** Every request/job MUST generate a `correlation_id` (UUID) at the edge and pass it down the stack. ## 2. 👁️ Monitoring & Zabbix Integration Our monitoring ecosystem is centered on **Zabbix**. Your application must be "Zabbix-Friendly". ### A. The Health Endpoint (`/health`) Every HTTP service MUST allow a Zabbix Web Scenario to check it. * **Path:** `/health` * **Response:** `200 OK` (JSON body optional but recommended: `{"status": "up", "db": "up"}`). * **Timeout:** Must respond in < 200ms. ### B. The "Zabbix Trapper" Pattern (Push Metrics) For batch jobs or async workers (CrewAI Agents), do not wait to be scraped. **PUSH** the metric using `zabbix_sender`. * **Python:** Use `py-zabbix` or raw socket. * **Key Standard:** `app.module.metric` (e.g., `crew.infra.task_duration`). * **When to use:** When a specific task finishes (e.g., "Agent compiled report in 45s"). ### C. Log Monitoring Keywords If you must rely on Log Scrapers (Zabbix Agent Active), use these "Trigger Keywords" to wake up Arthur: * `[SECURITY_BREACH]`: Immediate high alert. * `[DATA_LOSS]`: Critical alert. * `[DEPRECATED]`: Warning alert. ## 3. 🚨 Alerting Philosophy (Actionable Intelligence) Do not log an error if you already handled it gracefully. * **Handled Error:** Log as `WARN`. (No pager). * **Unhandled Code:** Log as `ERROR`. (Wake up Arthur). ## 4. 🤖 The Agent "Self-Diagnosis" (Log Audit) Before considering a task complete, run this audit: - [ ] **JSON Check:** Is my output parsable? - [ ] **Context:** Did I include `user_id`, `file_name`, or `job_id`? - [ ] **Silence:** Did I remove all `print(var)` statements used for debugging? - [ ] **Correlation:** Can I trace this error back to the user request?