71 lines
3.1 KiB
Markdown
71 lines
3.1 KiB
Markdown
# 🕵️ Observability & Logging Standards (The "Black Box" Recorder)
|
|
|
|
**Audience:** Developers & SRE Agents (Arthur Mendes).
|
|
**Objective:** Turn "It crashed" into "It crashed at line 42 because of variable X".
|
|
|
|
> [!CRITICAL]
|
|
> **The Arthur Mandate:**
|
|
> "Logs are for machines first, humans second. If I cannot `grep` or `jq` your log, it is useless noise."
|
|
|
|
## 1. 📝 Logging Protocol (Structured & Standardized)
|
|
|
|
### The Format: JSON Lines
|
|
Text logs are dead. Long live JSON.
|
|
**Why?** Because Zabbix, ELK, and simple Python scripts can parse it instantly.
|
|
|
|
**❌ BAD (Unstructured Text):**
|
|
`[2024-01-01 10:00] ERROR: User failed login. ID: 123`
|
|
|
|
**✅ GOOD (Structured JSON):**
|
|
```json
|
|
{"timestamp": "2024-01-01T10:00:00Z", "level": "ERROR", "event": "auth_failure", "user_id": 123, "reason": "invalid_password", "correlation_id": "abc-123"}
|
|
```
|
|
|
|
### The 4 Levels of Severity
|
|
1. **DEBUG**: Raw payloads, variable states. *Enable only during dev/diagnosis.*
|
|
2. **INFO**: Business Events. "User Created", "Job Started", "Payment Processed".
|
|
3. **WARN**: Recoverable issues. "Retry 1/3 failed", "Config missing (using default)".
|
|
4. **ERROR**: Operator intervention required. Stack traces, DB connection loss.
|
|
|
|
### The "12-Factor" Output
|
|
* **Containers:** Always write to `stdout` / `stderr`. Never write to local files inside a container (unless using a shared volume for specific audit trails).
|
|
* **Traceability:** Every request/job MUST generate a `correlation_id` (UUID) at the edge and pass it down the stack.
|
|
|
|
## 2. 👁️ Monitoring & Zabbix Integration
|
|
|
|
Our monitoring ecosystem is centered on **Zabbix**. Your application must be "Zabbix-Friendly".
|
|
|
|
### A. The Health Endpoint (`/health`)
|
|
Every HTTP service MUST allow a Zabbix Web Scenario to check it.
|
|
* **Path:** `/health`
|
|
* **Response:** `200 OK` (JSON body optional but recommended: `{"status": "up", "db": "up"}`).
|
|
* **Timeout:** Must respond in < 200ms.
|
|
|
|
### B. The "Zabbix Trapper" Pattern (Push Metrics)
|
|
For batch jobs or async workers (CrewAI Agents), do not wait to be scraped. **PUSH** the metric using `zabbix_sender`.
|
|
|
|
* **Python:** Use `py-zabbix` or raw socket.
|
|
* **Key Standard:** `app.module.metric` (e.g., `crew.infra.task_duration`).
|
|
* **When to use:** When a specific task finishes (e.g., "Agent compiled report in 45s").
|
|
|
|
### C. Log Monitoring Keywords
|
|
If you must rely on Log Scrapers (Zabbix Agent Active), use these "Trigger Keywords" to wake up Arthur:
|
|
* `[SECURITY_BREACH]`: Immediate high alert.
|
|
* `[DATA_LOSS]`: Critical alert.
|
|
* `[DEPRECATED]`: Warning alert.
|
|
|
|
## 3. 🚨 Alerting Philosophy (Actionable Intelligence)
|
|
|
|
Do not log an error if you already handled it gracefully.
|
|
* **Handled Error:** Log as `WARN`. (No pager).
|
|
* **Unhandled Code:** Log as `ERROR`. (Wake up Arthur).
|
|
|
|
## 4. 🤖 The Agent "Self-Diagnosis" (Log Audit)
|
|
|
|
Before considering a task complete, run this audit:
|
|
|
|
- [ ] **JSON Check:** Is my output parsable?
|
|
- [ ] **Context:** Did I include `user_id`, `file_name`, or `job_id`?
|
|
- [ ] **Silence:** Did I remove all `print(var)` statements used for debugging?
|
|
- [ ] **Correlation:** Can I trace this error back to the user request?
|