Canary — Agent Failure Detection

The silent failure problem

AI agents fail in ways that are nearly impossible to catch. They don't crash — they loop, drift, stall, and miss their schedule. You find out when something important didn't happen.

🔁

Infinite loops

Agent calls the same tool repeatedly. No error thrown. No output produced. You wait. Nothing comes back.

⏰

Missed cron jobs

Your 6am data pipeline didn't run. You find out at noon when the report is missing and clients are asking questions.

🔧

Tool failures pile up

API returns 429. Agent retries. Fails again. Retries. 50 leads unprocessed. You had no idea it was struggling.

🧠

Context overflow

Agent hits context limits, starts truncating, produces partial output. Looks like it ran. Didn't.

📭

No response

Agent started. Started doing something. Then went quiet. Is it running? Did it finish? You can't tell.

🔇

Everything looks fine

No exceptions. No alerts. No logs worth reading. But the output is wrong, incomplete, or missing entirely.

How Canary works

Lightweight instrumentation. Deterministic detection. No LLM in the monitoring path.

Register your agent

One API call to register. Canary creates an agent identity and tracks all runs against it. Works with any agent framework — OpenClaw, LangChain, custom Python, Node.js scripts.

Instrument your runs

Emit events as your agent works. Start a run, log tool calls, emit step completions. The SDK is 5 lines to integrate. Zero dependencies.

Python · 5-line integration

# Import the SDK
import sentinel

# Decorator auto-instruments the function
@sentinel.monitor("my-agent-script")
def run():
    result = call_some_api()
    sentinel.tool_call(run_id, "api_call")
    return result

Detection runs automatically

Canary applies deterministic rules to every event stream. No machine learning. No LLM calls. Rules like: 3 identical tool calls in 60s = loop. No events after run started = no response. Heartbeat missed by 2 minutes = cron miss.

Alert fires in under 10 seconds

When a failure is detected, Canary creates an incident and routes an alert to your channel — Discord, Slack, SMS, or webhook. Alert includes agent name, failure type, severity, affected runs, and a replay link.

Replay what happened

Every incident includes a full event replay — the exact sequence of tool calls, step transitions, and errors that led to the failure. Acknowledge, fix, resolve. MTTR drops from hours to minutes.

What Canary detects

Six failure classes, deterministic rules, no false positives from overfitting.

Loop

Infinite loops

Same tool or step called N times within a rolling time window. Configurable threshold.

Timeout

Run timeouts

Run exceeds max duration. Default 300s, configurable per agent. Alert before the user notices.

Tool Error

Tool error threshold

5+ tool errors in 60 seconds. API failing, rate limited, or returning garbage.

Cron Miss

Missed heartbeats

No Response

Silent stalls

Run started but no events received within 60 seconds. Something stopped without saying so.

Context

Context overflow

Agent emits a context_warning event. Canary creates an incident immediately before output degrades.

Quick start

From zero to monitored in under 5 minutes.

Step 1 · Register your agent

curl -X POST https://canary.meetkai.xyz/v1/agents \
  -H "Content-Type: application/json" \
  -d '{"name": "my-agent", "id": "my-agent-prod"}'

→ {"agent_id": "my-agent-prod", "name": "my-agent"}

Step 2 · Register a heartbeat (for cron jobs)

curl -X POST https://canary.meetkai.xyz/v1/heartbeats/register \
  -H "Content-Type: application/json" \
  -d '{
    "name": "daily-report",
    "agent_id": "my-agent-prod",
    "interval_seconds": 86400,
    "grace_seconds": 300
  }'

→ {"heartbeat_id": "hb_01XYZ", "ping_url": "..."}

Step 3 · Ping the heartbeat at end of each run

# Python
import requests
requests.post("https://canary.meetkai.xyz/v1/heartbeats/hb_01XYZ/ping")

# Shell (add to end of your cron script)
curl -s -X POST https://canary.meetkai.xyz/v1/heartbeats/hb_01XYZ/ping

Step 4 · Emit events from agent code (Python SDK)

# Download SDK
curl -O https://canary.meetkai.xyz/sdk/sentinel.py

# Use in your script
import sentinel

@sentinel.monitor("my-agent")
def run():
    # Your existing code — Canary wraps it
    result = process_leads()
    return result

API reference

Base URL: https://canary.meetkai.xyz

Agents

POST

/v1/agents

GET

/v1/agents/:agent_id/runs

List recent runs for an agent. Filter by status: running, success, failed, timeout.

Events

POST

/v1/events

Ingest an event. Triggers failure detection automatically. Body: agent_id, run_id, event_type, severity, message, metadata.

GET

/v1/runs/:run_id/events

Get all events for a run. Used for incident replay.

Heartbeats

POST

/v1/heartbeats/register

POST

/v1/heartbeats/:id/ping

Ping a heartbeat. Call this at the end of every successful cron run.

Incidents

GET

/v1/incidents

List incidents. Filter by agent_id, status (open/acknowledged/resolved), severity.

POST

/v1/incidents/:id/acknowledge

Acknowledge an incident. Silences repeat alerts. Attach a note.

POST

/v1/incidents/:id/resolve

Mark incident resolved. Closes the loop.

Event types

tool_call       → agent called a tool
tool.failure    → tool call failed (triggers error threshold detection)
run.step        → agent completed a step
run.fail        → agent run failed explicitly
run.timeout     → agent exceeded time limit
loop_detected   → manual loop signal
context_warning → context window nearing limit (triggers immediate incident)

Severity levels

info     → normal operation, logged only
warn     → something unusual, no alert
error    → failure detected, alert fires
critical → P0, immediate alert, escalation path

Pricing

Start free. Pay when it's saving you money.

Starter

Free

Forever, no card needed

3 agents
500 runs/month
Discord + webhook alerts
7-day event retention
Heartbeat monitoring

Growth

$99/mo

For teams running real agent stacks

25 agents
10,000 runs/month
Slack + SMS alerts
90-day retention
Run replay
Incident reports (PDF)

Scale

$399/mo

For agencies and platforms

100 agents
Unlimited runs
All alert channels
Unlimited retention
Courier File Bus included
Priority support

Your agents fail.You find out first.

The silent failure problem

Infinite loops

Missed cron jobs

Tool failures pile up

Context overflow

No response

Everything looks fine

How Canary works

Register your agent

Instrument your runs

Detection runs automatically

Alert fires in under 10 seconds

Replay what happened

What Canary detects

Infinite loops

Run timeouts

Tool error threshold

Missed heartbeats

Silent stalls

Context overflow

Quick start

API reference

Agents

Events

Heartbeats

Incidents

Pricing

Your agents fail.
You find out first.