3.1 KiB
🕵️ Observability & Logging Standards (The "Black Box" Recorder)
Audience: Developers & SRE Agents (Arthur Mendes). Objective: Turn "It crashed" into "It crashed at line 42 because of variable X".
[!CRITICAL] The Arthur Mandate: "Logs are for machines first, humans second. If I cannot
greporjqyour log, it is useless noise."
1. 📝 Logging Protocol (Structured & Standardized)
The Format: JSON Lines
Text logs are dead. Long live JSON. Why? Because Zabbix, ELK, and simple Python scripts can parse it instantly.
❌ BAD (Unstructured Text):
[2024-01-01 10:00] ERROR: User failed login. ID: 123
✅ GOOD (Structured JSON):
{"timestamp": "2024-01-01T10:00:00Z", "level": "ERROR", "event": "auth_failure", "user_id": 123, "reason": "invalid_password", "correlation_id": "abc-123"}
The 4 Levels of Severity
- DEBUG: Raw payloads, variable states. Enable only during dev/diagnosis.
- INFO: Business Events. "User Created", "Job Started", "Payment Processed".
- WARN: Recoverable issues. "Retry 1/3 failed", "Config missing (using default)".
- ERROR: Operator intervention required. Stack traces, DB connection loss.
The "12-Factor" Output
- Containers: Always write to
stdout/stderr. Never write to local files inside a container (unless using a shared volume for specific audit trails). - Traceability: Every request/job MUST generate a
correlation_id(UUID) at the edge and pass it down the stack.
2. 👁️ Monitoring & Zabbix Integration
Our monitoring ecosystem is centered on Zabbix. Your application must be "Zabbix-Friendly".
A. The Health Endpoint (/health)
Every HTTP service MUST allow a Zabbix Web Scenario to check it.
- Path:
/health - Response:
200 OK(JSON body optional but recommended:{"status": "up", "db": "up"}). - Timeout: Must respond in < 200ms.
B. The "Zabbix Trapper" Pattern (Push Metrics)
For batch jobs or async workers (CrewAI Agents), do not wait to be scraped. PUSH the metric using zabbix_sender.
- Python: Use
py-zabbixor raw socket. - Key Standard:
app.module.metric(e.g.,crew.infra.task_duration). - When to use: When a specific task finishes (e.g., "Agent compiled report in 45s").
C. Log Monitoring Keywords
If you must rely on Log Scrapers (Zabbix Agent Active), use these "Trigger Keywords" to wake up Arthur:
[SECURITY_BREACH]: Immediate high alert.[DATA_LOSS]: Critical alert.[DEPRECATED]: Warning alert.
3. 🚨 Alerting Philosophy (Actionable Intelligence)
Do not log an error if you already handled it gracefully.
- Handled Error: Log as
WARN. (No pager). - Unhandled Code: Log as
ERROR. (Wake up Arthur).
4. 🤖 The Agent "Self-Diagnosis" (Log Audit)
Before considering a task complete, run this audit:
- JSON Check: Is my output parsable?
- Context: Did I include
user_id,file_name, orjob_id? - Silence: Did I remove all
print(var)statements used for debugging? - Correlation: Can I trace this error back to the user request?