Operator escalation: when humans need to take over from the agent

The hardest decision in agent design is when the agent stops and a human takes over. Too early and you have not automated anything. Too late and the agent makes a decision a human should have made. This post documents the escalation matrix every DPL engagement ships with at pilot stage.

Four escalation triggers

Escalation is not a single threshold. It is four independent triggers, any of which routes the case to a human queue.

Confidence threshold: every agent decision carries a self-reported confidence score (or a derived one based on log-probs, retrieval relevance, or eval-suite proximity). Below a workflow-specific threshold (typically 0.65 for high-stakes, 0.45 for low-stakes), the case routes to operator review. The threshold is tunable post-launch as you accumulate operator decisions on the borderline cases.

Sensitivity tags: certain input attributes auto-route regardless of confidence. A customer ticket from a Tier-1 account, an invoice over a dollar threshold, a contract with a non-standard clause, any payment-related action. The tag list is workflow-specific and gets written in the pilot scope document.

Cost ceiling: if a single agent decision is about to consume more than $X in tokens or trigger more than Y tool calls, escalate. Catches runaway loops and unusually complex requests that should have a human eye before commit.

Sentiment triggers: in any workflow that touches direct customer communication, a sentiment classifier runs on the conversation. Negative sentiment crossing a threshold routes the case to a human reply, not an agent-drafted one. The thresholds are tuned per workflow.

The escalation queue

When any trigger fires, the case lands in an operator queue. The queue is not a Slack channel (operators miss things) or an email (no SLA). It is a structured workflow with three properties.

Acknowledged-within-N-minutes. The SLA is documented per workflow. Typical: 30-60 minutes during business hours, 4-12 hours outside business hours. Cases that exceed the acknowledgment SLA auto-escalate to a named on-call person.

Operator-decides-and-records. The operator does not just respond to the user. They record the decision back to the agent's training data. Future similar cases will benefit from this human label.

Closed-with-attribution. Every escalation gets closed with a one-line reason. Categories include: false-positive escalation (agent could have handled), agent-mistake-caught (agent would have erred), edge-case-confirmed (operator confirmed a new pattern), policy-override (operator chose differently than policy suggested).

The escalation budget

A well-tuned agent escalates 3-12% of cases depending on workflow stakes. Below 3% you are probably under-escalating and missing the edge cases the agent should be flagging. Above 12% you are probably under-trained or your thresholds are too aggressive.

The escalation rate is reviewed weekly. Trending up means drift; trending down means the agent is improving or the thresholds drifted too lax. Neither is automatically good or bad; the operator notes provide the context.

What operators actually do during escalation

Three patterns account for 95% of escalation work.

Direct decision: the operator looks at the case, makes the decision the agent could not, executes it through the same tools the agent would have used. Time per case: 2-8 minutes for typical workflows.

Disambiguation: the operator asks the user one clarifying question, then either takes the decision or hands back to the agent with the clarification appended. Time per case: 4-10 minutes including the back-and-forth.

Policy update: the operator decides the current escalation pattern indicates a policy gap. They write a short update to the workflow's policy doc and the next eval-suite update incorporates it. Time: variable, but the policy update applies forward to all similar future cases.

The SLA you commit to in writing

Every DPL retainer documents three numbers per workflow.

Acknowledgment SLA: when an escalation enters the queue, how long before an operator picks it up.

Resolution SLA: how long from acknowledgment to the user-visible outcome.

Escalation-of-escalation SLA: if the assigned operator does not acknowledge in N minutes, who gets paged and how.

These are not nice-to-have; they are the contract. A workflow without written SLAs on these three numbers is not a production deployment.

The takeaway

Operator backstop is not a fallback. It is a designed feature that catches what the agent should not handle alone. Every DPL engagement gets the escalation matrix written before code ships. If you have a production agent without one, a recovery diagnosis scopes it in week 1 of the audit.