Canary detects silent failures in AI agent stacks โ loops, timeouts, cron misses, tool errors โ and fires an alert in under 10 seconds.
AI agents fail in ways that are nearly impossible to catch. They don't crash โ they loop, drift, stall, and miss their schedule. You find out when something important didn't happen.
Agent calls the same tool repeatedly. No error thrown. No output produced. You wait. Nothing comes back.
Your 6am data pipeline didn't run. You find out at noon when the report is missing and clients are asking questions.
API returns 429. Agent retries. Fails again. Retries. 50 leads unprocessed. You had no idea it was struggling.
Agent hits context limits, starts truncating, produces partial output. Looks like it ran. Didn't.
Agent started. Started doing something. Then went quiet. Is it running? Did it finish? You can't tell.
No exceptions. No alerts. No logs worth reading. But the output is wrong, incomplete, or missing entirely.
Lightweight instrumentation. Deterministic detection. No LLM in the monitoring path.
One API call to register. Canary creates an agent identity and tracks all runs against it. Works with any agent framework โ OpenClaw, LangChain, custom Python, Node.js scripts.
Emit events as your agent works. Start a run, log tool calls, emit step completions. The SDK is 5 lines to integrate. Zero dependencies.
# Import the SDK import sentinel # Decorator auto-instruments the function @sentinel.monitor("my-agent-script") def run(): result = call_some_api() sentinel.tool_call(run_id, "api_call") return result
Canary applies deterministic rules to every event stream. No machine learning. No LLM calls. Rules like: 3 identical tool calls in 60s = loop. No events after run started = no response. Heartbeat missed by 2 minutes = cron miss.
When a failure is detected, Canary creates an incident and routes an alert to your channel โ Discord, Slack, SMS, or webhook. Alert includes agent name, failure type, severity, affected runs, and a replay link.
Every incident includes a full event replay โ the exact sequence of tool calls, step transitions, and errors that led to the failure. Acknowledge, fix, resolve. MTTR drops from hours to minutes.
Six failure classes, deterministic rules, no false positives from overfitting.
Same tool or step called N times within a rolling time window. Configurable threshold.
Run exceeds max duration. Default 300s, configurable per agent. Alert before the user notices.
5+ tool errors in 60 seconds. API failing, rate limited, or returning garbage.
Register a heartbeat. Ping it when your cron runs. Miss it by more than the grace period โ alert fires.
Run started but no events received within 60 seconds. Something stopped without saying so.
Agent emits a context_warning event. Canary creates an incident immediately before output degrades.
From zero to monitored in under 5 minutes.
curl -X POST https://canary.meetkai.xyz/v1/agents \
-H "Content-Type: application/json" \
-d '{"name": "my-agent", "id": "my-agent-prod"}'
โ {"agent_id": "my-agent-prod", "name": "my-agent"}
curl -X POST https://canary.meetkai.xyz/v1/heartbeats/register \
-H "Content-Type: application/json" \
-d '{
"name": "daily-report",
"agent_id": "my-agent-prod",
"interval_seconds": 86400,
"grace_seconds": 300
}'
โ {"heartbeat_id": "hb_01XYZ", "ping_url": "..."}
# Python import requests requests.post("https://canary.meetkai.xyz/v1/heartbeats/hb_01XYZ/ping") # Shell (add to end of your cron script) curl -s -X POST https://canary.meetkai.xyz/v1/heartbeats/hb_01XYZ/ping
# Download SDK curl -O https://canary.meetkai.xyz/sdk/sentinel.py # Use in your script import sentinel @sentinel.monitor("my-agent") def run(): # Your existing code โ Canary wraps it result = process_leads() return result
Base URL: https://canary.meetkai.xyz
tool_call โ agent called a tool tool.failure โ tool call failed (triggers error threshold detection) run.step โ agent completed a step run.fail โ agent run failed explicitly run.timeout โ agent exceeded time limit loop_detected โ manual loop signal context_warning โ context window nearing limit (triggers immediate incident)
info โ normal operation, logged only warn โ something unusual, no alert error โ failure detected, alert fires critical โ P0, immediate alert, escalation path
Start free. Pay when it's saving you money.