minions-ai-agents/antigravity_brain_export/knowledge/observability_standards.md

3.1 KiB

🕵️ Observability & Logging Standards (The "Black Box" Recorder)

Audience: Developers & SRE Agents (Arthur Mendes). Objective: Turn "It crashed" into "It crashed at line 42 because of variable X".

[!CRITICAL] The Arthur Mandate: "Logs are for machines first, humans second. If I cannot grep or jq your log, it is useless noise."

1. 📝 Logging Protocol (Structured & Standardized)

The Format: JSON Lines

Text logs are dead. Long live JSON. Why? Because Zabbix, ELK, and simple Python scripts can parse it instantly.

BAD (Unstructured Text): [2024-01-01 10:00] ERROR: User failed login. ID: 123

GOOD (Structured JSON):

{"timestamp": "2024-01-01T10:00:00Z", "level": "ERROR", "event": "auth_failure", "user_id": 123, "reason": "invalid_password", "correlation_id": "abc-123"}

The 4 Levels of Severity

  1. DEBUG: Raw payloads, variable states. Enable only during dev/diagnosis.
  2. INFO: Business Events. "User Created", "Job Started", "Payment Processed".
  3. WARN: Recoverable issues. "Retry 1/3 failed", "Config missing (using default)".
  4. ERROR: Operator intervention required. Stack traces, DB connection loss.

The "12-Factor" Output

  • Containers: Always write to stdout / stderr. Never write to local files inside a container (unless using a shared volume for specific audit trails).
  • Traceability: Every request/job MUST generate a correlation_id (UUID) at the edge and pass it down the stack.

2. 👁️ Monitoring & Zabbix Integration

Our monitoring ecosystem is centered on Zabbix. Your application must be "Zabbix-Friendly".

A. The Health Endpoint (/health)

Every HTTP service MUST allow a Zabbix Web Scenario to check it.

  • Path: /health
  • Response: 200 OK (JSON body optional but recommended: {"status": "up", "db": "up"}).
  • Timeout: Must respond in < 200ms.

B. The "Zabbix Trapper" Pattern (Push Metrics)

For batch jobs or async workers (CrewAI Agents), do not wait to be scraped. PUSH the metric using zabbix_sender.

  • Python: Use py-zabbix or raw socket.
  • Key Standard: app.module.metric (e.g., crew.infra.task_duration).
  • When to use: When a specific task finishes (e.g., "Agent compiled report in 45s").

C. Log Monitoring Keywords

If you must rely on Log Scrapers (Zabbix Agent Active), use these "Trigger Keywords" to wake up Arthur:

  • [SECURITY_BREACH]: Immediate high alert.
  • [DATA_LOSS]: Critical alert.
  • [DEPRECATED]: Warning alert.

3. 🚨 Alerting Philosophy (Actionable Intelligence)

Do not log an error if you already handled it gracefully.

  • Handled Error: Log as WARN. (No pager).
  • Unhandled Code: Log as ERROR. (Wake up Arthur).

4. 🤖 The Agent "Self-Diagnosis" (Log Audit)

Before considering a task complete, run this audit:

  • JSON Check: Is my output parsable?
  • Context: Did I include user_id, file_name, or job_id?
  • Silence: Did I remove all print(var) statements used for debugging?
  • Correlation: Can I trace this error back to the user request?