Agent Infrastructure

Your agents fail.
You find out first.

Canary detects silent failures in AI agent stacks โ€” loops, timeouts, cron misses, tool errors โ€” and fires an alert in under 10 seconds.

Start monitoring free View API docs
๐Ÿฆ
Canary Alert
P0 CRITICAL
kaicalls-lead-outreach detected a loop.
Tool hunter_verify called 5ร— in 60s.
Run run_01ABC ยท 8 leads unprocessed.

The silent failure problem

AI agents fail in ways that are nearly impossible to catch. They don't crash โ€” they loop, drift, stall, and miss their schedule. You find out when something important didn't happen.

๐Ÿ”

Infinite loops

Agent calls the same tool repeatedly. No error thrown. No output produced. You wait. Nothing comes back.

โฐ

Missed cron jobs

Your 6am data pipeline didn't run. You find out at noon when the report is missing and clients are asking questions.

๐Ÿ”ง

Tool failures pile up

API returns 429. Agent retries. Fails again. Retries. 50 leads unprocessed. You had no idea it was struggling.

๐Ÿง 

Context overflow

Agent hits context limits, starts truncating, produces partial output. Looks like it ran. Didn't.

๐Ÿ“ญ

No response

Agent started. Started doing something. Then went quiet. Is it running? Did it finish? You can't tell.

๐Ÿ”‡

Everything looks fine

No exceptions. No alerts. No logs worth reading. But the output is wrong, incomplete, or missing entirely.


How Canary works

Lightweight instrumentation. Deterministic detection. No LLM in the monitoring path.

1

Register your agent

One API call to register. Canary creates an agent identity and tracks all runs against it. Works with any agent framework โ€” OpenClaw, LangChain, custom Python, Node.js scripts.

2

Instrument your runs

Emit events as your agent works. Start a run, log tool calls, emit step completions. The SDK is 5 lines to integrate. Zero dependencies.

Python ยท 5-line integration
# Import the SDK
import sentinel

# Decorator auto-instruments the function
@sentinel.monitor("my-agent-script")
def run():
    result = call_some_api()
    sentinel.tool_call(run_id, "api_call")
    return result
3

Detection runs automatically

Canary applies deterministic rules to every event stream. No machine learning. No LLM calls. Rules like: 3 identical tool calls in 60s = loop. No events after run started = no response. Heartbeat missed by 2 minutes = cron miss.

4

Alert fires in under 10 seconds

When a failure is detected, Canary creates an incident and routes an alert to your channel โ€” Discord, Slack, SMS, or webhook. Alert includes agent name, failure type, severity, affected runs, and a replay link.

5

Replay what happened

Every incident includes a full event replay โ€” the exact sequence of tool calls, step transitions, and errors that led to the failure. Acknowledge, fix, resolve. MTTR drops from hours to minutes.


What Canary detects

Six failure classes, deterministic rules, no false positives from overfitting.

Loop

Infinite loops

Same tool or step called N times within a rolling time window. Configurable threshold.

Timeout

Run timeouts

Run exceeds max duration. Default 300s, configurable per agent. Alert before the user notices.

Tool Error

Tool error threshold

5+ tool errors in 60 seconds. API failing, rate limited, or returning garbage.

Cron Miss

Missed heartbeats

Register a heartbeat. Ping it when your cron runs. Miss it by more than the grace period โ€” alert fires.

No Response

Silent stalls

Run started but no events received within 60 seconds. Something stopped without saying so.

Context

Context overflow

Agent emits a context_warning event. Canary creates an incident immediately before output degrades.


Quick start

From zero to monitored in under 5 minutes.

Step 1 ยท Register your agent
curl -X POST https://canary.meetkai.xyz/v1/agents \
  -H "Content-Type: application/json" \
  -d '{"name": "my-agent", "id": "my-agent-prod"}'

โ†’ {"agent_id": "my-agent-prod", "name": "my-agent"}
Step 2 ยท Register a heartbeat (for cron jobs)
curl -X POST https://canary.meetkai.xyz/v1/heartbeats/register \
  -H "Content-Type: application/json" \
  -d '{
    "name": "daily-report",
    "agent_id": "my-agent-prod",
    "interval_seconds": 86400,
    "grace_seconds": 300
  }'

โ†’ {"heartbeat_id": "hb_01XYZ", "ping_url": "..."}
Step 3 ยท Ping the heartbeat at end of each run
# Python
import requests
requests.post("https://canary.meetkai.xyz/v1/heartbeats/hb_01XYZ/ping")

# Shell (add to end of your cron script)
curl -s -X POST https://canary.meetkai.xyz/v1/heartbeats/hb_01XYZ/ping
Step 4 ยท Emit events from agent code (Python SDK)
# Download SDK
curl -O https://canary.meetkai.xyz/sdk/sentinel.py

# Use in your script
import sentinel

@sentinel.monitor("my-agent")
def run():
    # Your existing code โ€” Canary wraps it
    result = process_leads()
    return result

API reference

Base URL: https://canary.meetkai.xyz

Agents

POST
/v1/agents
Register an agent. Returns agent_id used to associate all runs and events.
GET
/v1/agents/:agent_id/runs
List recent runs for an agent. Filter by status: running, success, failed, timeout.

Events

POST
/v1/events
Ingest an event. Triggers failure detection automatically. Body: agent_id, run_id, event_type, severity, message, metadata.
GET
/v1/runs/:run_id/events
Get all events for a run. Used for incident replay.

Heartbeats

POST
/v1/heartbeats/register
Register a named heartbeat with interval and grace period. Returns a ping URL.
POST
/v1/heartbeats/:id/ping
Ping a heartbeat. Call this at the end of every successful cron run.

Incidents

GET
/v1/incidents
List incidents. Filter by agent_id, status (open/acknowledged/resolved), severity.
POST
/v1/incidents/:id/acknowledge
Acknowledge an incident. Silences repeat alerts. Attach a note.
POST
/v1/incidents/:id/resolve
Mark incident resolved. Closes the loop.
Event types
tool_call       โ†’ agent called a tool
tool.failure    โ†’ tool call failed (triggers error threshold detection)
run.step        โ†’ agent completed a step
run.fail        โ†’ agent run failed explicitly
run.timeout     โ†’ agent exceeded time limit
loop_detected   โ†’ manual loop signal
context_warning โ†’ context window nearing limit (triggers immediate incident)
Severity levels
info     โ†’ normal operation, logged only
warn     โ†’ something unusual, no alert
error    โ†’ failure detected, alert fires
critical โ†’ P0, immediate alert, escalation path

Pricing

Start free. Pay when it's saving you money.

Starter
Free
Forever, no card needed
  • 3 agents
  • 500 runs/month
  • Discord + webhook alerts
  • 7-day event retention
  • Heartbeat monitoring
Scale
$399/mo
For agencies and platforms
  • 100 agents
  • Unlimited runs
  • All alert channels
  • Unlimited retention
  • Courier File Bus included
  • Priority support