Most teams shipped an AI agent in 2024 or 2025. By mid-2026, 80% of enterprise applications now embed at least one AI agent. The same Gartner data shows that only 31% of enterprises have an agent in actual production. Meaning the gap between "we shipped one" and "it runs reliably" is enormous.
This is the 30-criteria checklist DPL runs on every recovery engagement. It is the same checklist that drives our free 10-question diagnostic. The public version tests the ten highest-leverage criteria. The full checklist below is what an agent has to clear to be considered production-ready in our model.
If your agent is missing more than eight of these, you are in recovery territory. We diagnose in 2 weeks for $5,000, fix in 4 weeks for $10,000, and operate from there at the standard $2,500/month retainer.
Observability. Six criteria
- Per-decision logs. Every agent decision, every tool call, every retry is logged with timestamp, input hash, output hash, and outcome. No exceptions. If you cannot answer "what did the agent do for user X at 3:47 a.m.?" in under 30 seconds, you fail.
- PII redaction in logs. Logs must redact identifiers (email, phone, account numbers) before they hit your observability platform. If your Datadog or Sentry has raw user emails, you have a compliance liability waiting to surface.
- Cost per task tracked. Every invocation logs token spend or platform-call cost against a per-task ceiling. Without per-task cost tracking, a single runaway loop can drain a month of budget in hours.
- Latency distribution. P50, P95, P99 latency per agent action, persisted for at least 30 days. Agents that work at P50 and fail at P99 are the worst kind of broken. Invisible until someone with a real workload notices.
- Failure mode categorization. Failures are tagged at the source. "tool-call returned 503", "LLM refused to comply", "output failed schema validation", "operator overrode the suggestion". Untagged failures stack up as a generic "error" bucket and you cannot fix what you cannot categorize.
- Live client visibility. A real-time channel (Slack Connect or equivalent) where the client sees every agent decision and operator intervention as it happens. Platforms hide what their agents do. Production-grade engagements show everything.
Retry + failure handling. Five criteria
- Exponential backoff on tool calls. Naive retries amplify outages. Backoff with jitter is non-negotiable.
- Dead-letter queue. Failed invocations after N retries land in a DLQ that a human can inspect and replay. Failures must not vanish.
- Per-workflow kill-switch. Operator can pause a workflow in production without redeploying code. Important for runaway-cost scenarios.
- Rollback to last-known-good in under 5 minutes. Prompts and tool definitions are versioned. Reverting a bad change is a one-command operation, not a code-archaeology session.
- Idempotency on writes. If the agent writes to your CRM and the retry runs, you do not get duplicate records. This is where most "the agent broke our customer data" incidents come from.
Drift detection. Three criteria
- Baseline eval suite. A fixed set of 20-50 representative inputs with expected outputs. Run before every prompt change. If the suite regresses, the change does not ship.
- Production sampling. A small percentage of real production outputs sampled into a human-review queue weekly. Drift surfaces here long before users complain.
- Model-version pinning. Your agent does not call "claude-3-5-sonnet". It calls "claude-3-5-sonnet-20250619". Vendor model updates do not silently change your behavior.
Security. Five criteria
- Prompt injection resistance. User-supplied inputs go through a sanitization layer. Outputs are guarded against unintended tool calls. Tested with a corpus of known injection patterns at deploy time.
- Least-privilege credentials. The agent runs with read-only or read-write credentials scoped to exactly the resources it needs. No "service account with admin." This is how a one-shot prompt-injection bug becomes a data incident.
- Output validation. Structured outputs validated against a schema (Zod, Pydantic, JSON Schema) before they reach a downstream tool. Hallucinated tool calls do not silently execute.
- Audit log retention. 90-day minimum retention on all decision logs, with PII redacted. Some industries need longer; this is the floor.
- Secret rotation drill. You have rotated the underlying API keys in the last 90 days and the agent did not break. If you have not run this drill, you do not know how brittle the deployment is.
Operations. Six criteria
- Documented runbook. A new operator can answer "what does this agent do, how does it fail, how do I pause it, how do I roll back" by reading a single document. Not a Notion wiki of seven documents.
- Escalation matrix. Every failure category routes to a named human or pager rotation. Silent failure to the end user is the worst outcome. Slow human reply is fine.
- SLA on escalation. "Operator picks up the case within X hours during business hours, within Y hours outside" is written somewhere and tracked.
- Operator UI. When an escalation lands, the operator has a console showing the full context. Input, agent reasoning, attempted actions, downstream state. Not "go check the logs."
- On-call coverage. Production agents have a defined on-call rotation. "We will look at it Monday morning" is not on-call.
- Weekly narrative report. A short written analysis of what ran, what broke, what improved. Numbers without narrative are insufficient.
Cost + economics. Three criteria
- Token budget per workflow. Each workflow has a documented monthly token budget. Exceeding it pages someone.
- Cost-per-outcome attribution. You can answer "what did this lead cost to qualify" or "what did this report cost to generate". Not "we spent $X on LLM tokens this month".
- Build-vs-buy math refreshed. Every quarter, the in-house alternative cost is recalculated. If the agent stops being the cheaper path, you should know within the quarter, not the year.
Compliance + contractual. Two criteria
- DPA signed. A Data Processing Agreement is in place with the underlying model vendor (Anthropic, OpenAI, Groq, etc.). For regulated industries, this is non-negotiable.
- Recovery plan documented. If the agent goes offline, the manual fallback is documented and the team has practiced it. Single-point-of-failure agents are an operational liability dressed as productivity.
What to do with this checklist
Score your agent. If you clear 27 or more, you are in the top tier. Congratulations, the gaps left are worth tracking but not crisis territory. If you clear 18-26, you have a single failure mode away from a 3 a.m. incident, and a recovery diagnosis pinpoints which one. If you clear fewer than 18, your agent is the profile that becomes the incident in the next 90 days, and the time to fix it is now.
Run the public 10-question version in under 2 minutes. It covers the highest-leverage criteria from the list above and gives you a triage score before you decide whether to engage with us or anyone else.
If the score comes back red, book a recovery diagnosis. $5,000 for the 30-criteria audit. Two weeks. Written report. No retainer commitment until you decide to engage on the fix.
