Why Your AI Agent Needs a Human Safety Net (And How to Build One)

Why Your AI Agent Needs a Human Safety Net (And How to Build One)

Nov 03, 2025 - 1 Min read

The Problem with AI Agents Nobody’s Talking About

McDonald’s launched AI-powered drive-thru ordering across multiple locations. The technology was impressive — it could understand speech, process orders, and integrate with point-of-sale systems. Within months, the trial was quietly shut down.

The problem wasn’t technical capability. The AI confidently processed orders for bacon-topped ice cream. It added hundreds of McNuggets to orders without question. It served wrong items with complete certainty.

An experienced human would naturally flag these situations: “Did you really want 260 McNuggets?” The AI simply processed them, confident in every decision it made.

This is the gap causing AI agent deployments to fail across industries. Traditional machine learning could tell you “I’m 85% confident this is a phishing email.” You could set thresholds: above 85%, automate; below 85%, human reviews. This simple mechanism enabled our system at a global market research company to save several million pounds annually—we knew exactly when to trust the model and when not to.

AI agents powered by LLMs don’t work that way. Ask an LLM “how confident are you?” and it’ll give you an answer. But that answer is generated text, not a calibrated probability. It’s a plausible-sounding statement that could be completely wrong.

Without confidence measurement, you’re either over-investing in unnecessary human review (reducing ROI) or under-investing and discovering problems in production (destroying customer trust).

The issue isn’t that LLMs aren’t capable. It’s that they lack the built-in uncertainty signals that traditional ML provides, and most teams are deploying them without building those signals in.

This post shows you how to fix that—with practical strategies you can implement this week.

How machine learning handles uncertainty

Traditional ML systems provide something beautifully simple: a confidence score for every decision.

When I built a system to classify grocery products at a global market research company, we needed to assign thousands of products to their correct brands. Getting Coca-Cola classified as Pepsi could misattribute millions in sales to rival brands. Accuracy mattered.

But we knew the system wouldn’t always be certain. So we designed it with a straightforward pattern:

The model makes a prediction with a confidence score (e.g., 92% confident this is Coca-Cola)
We measured when the model’s accuracy exceeded human performance
Above threshold (85%), we auto-classified
Below threshold? A human expert reviewed it
Human decisions fed back to improve the model over time

This gave us the best of both worlds: automation where we’re confident, human expertise where we’re not. It led to several million pounds in cost savings precisely because we knew when to trust the model and when not to.

The business logic was simple: if you can’t measure confidence, you can’t automate safely. If you can measure it, you can build guardrails that deliver both speed and accuracy.

This is good engineering: understanding your system’s limitations and designing around them.

The LLM uncertainty problem

LLMs generate text token by token, predicting the next most likely word. They do have internal confidence scores for each token, but there’s a fundamental problem:

Token-level confidence doesn’t translate to decision-level confidence.

Knowing the model is 95% confident about the next word tells you nothing about whether the entire decision is correct. Here’s why this breaks traditional approaches:

Most LLM APIs don’t expose those probabilities anyway. You get text out, not probability distributions—no way to check the internal confidence even if you wanted to.
Token confidence isn’t calibrated for your specific business decisions. The model’s confidence about generating grammatically correct text isn’t the same as confidence about “should I process this refund?”
If you ask the LLM “how confident are you?”, it just generates more text. That answer isn’t a probability—it’s a plausible-sounding statement that could be completely uncalibrated.
Multiple valid outputs eliminate simple thresholds. For “schedule a meeting tomorrow afternoon,” valid times might be 2pm, 3pm, or 4pm. The LLM might assign 40% to 2pm, 35% to 3pm, 25% to 4pm. None individually hit a “high confidence” threshold, yet all are correct.

Why This Matters for Agents

An agent’s fundamental purpose is to carry out tasks autonomously. When you give an agent the ability to take actions—creating support tickets, processing refunds, sending emails—you’re asking it to make decisions without knowing how certain it is.

This is fine for low-stakes decisions. If the user wants “a good recipe for chicken,” there’s no absolute right or wrong answer. If the agent’s suggestion doesn’t align perfectly, no one is harmed.

But high-stakes decisions are different entirely.

Real Example: Education Agent Uncertainty

I recently worked on an AI system for an education platform where students could ask questions about books they were reading. The agent needed to:

Generate appropriate questions based on the book content
Evaluate whether student answers were correct
Provide encouraging, accurate feedback

Uncertainty propagates through every step. We want to ensure we grade students’ answers correctly and that feedback aligns with their response, the content, and the curriculum.

To counter this, we built a multi-agent system where different specialists handled different aspects. We created an evaluation pipeline where one agent played the student (providing various types of responses) while another provided feedback. This let us measure accuracy and identify where the agent was uncertain—by investigating test cases where feedback was deemed inaccurate.

We only deployed when the agent met our success criteria. This is the engineering discipline that’s missing from most AI agent deployments.

🍔 What Went Wrong at McDonald’s

The Core Issue: The AI couldn’t distinguish unusual orders from normal ones.

What Happened:

Bacon added to ice cream (wrong item detection)
260 McNuggets processed without question (quantity validation)
Incorrect orders served confidently (no escalation mechanism)

The Root Cause: Whether from speech recognition errors or genuine requests, the AI lacked the context to question unusual orders. It had no framework to recognize “this seems odd—I should check.”

The Business Impact: Trial shut down, investment written off, brand embarrassment in national press.

The Engineering Lesson: Humans naturally understand what’s normal. They question outliers. AI agents need this explicitly built in through confidence measurement and escalation paths—it doesn’t emerge automatically from better models.

The pattern I keep seeing

AI has made it easier than ever to develop and deploy AI-enabled functionality. I’ve personally worked on projects where there’s immense pressure to deliver quickly, and AI-enabled coding agents enable software to be developed rapidly (so-called “vibe coding”).

However, across multiple projects and conversations with other teams, I see the same pattern:

Team builds an impressive agent with great capabilities
They test it on known scenarios and it works brilliantly
Business gets excited and they deploy it
Then they discover the edge cases where the agent makes confidently wrong decisions
By then it’s caused business problems—wrong decisions made, customer trust damaged, cleanup required

I can now predict which companies will have production issues based on one question: “How do you know when your agent is uncertain?”

If they can’t answer clearly, they’re not ready to deploy.

The issue isn’t that LLMs aren’t capable. It’s that they lack the wide experiential context that humans naturally have, and need to have explicit protocols put in place to handle situations where they make confident but incorrect decisions.

Without confidence scoring and a framework to manage uncertainty, it’s impossible to know what the boundary conditions are until you hit them in production.

Strategies for Managing Agent Uncertainty

Without confidence scores, how do you know when to let your agent act autonomously versus when to pause for human review?

From my experience building agentic AI systems that use LLMs, the answer isn’t a single threshold or blanket rule. Instead, successful production systems use different autonomy strategies depending on risk, context, and deployment maturity.

These aren’t mutually exclusive. Real production systems typically combine multiple approaches—for example, using bounded autonomy to categorise risk levels, confidence-led autonomy for intelligent decision-making within safe boundaries, and progressive autonomy when deploying new capabilities.

Quick Assessment: If you’re early in your AI agent journey, start with Constrained or Progressive Autonomy. If you’re scaling production systems, combine Bounded Autonomy with Confidence-Led approaches.

Here’s how to choose the right approach for your situation:

Strategy	Best Used When	Pros	Cons	Reliability Characteristics
Constrained Autonomy	High-stakes decisions, new deployments, regulated environments	Simple to implement; clear escalation paths; minimises risk	Can create bottlenecks; may frustrate users with delays; requires available human expertise	High reliability through human oversight, but limited by human availability and response time
Bounded Autonomy	Agents operating across varied risk levels	Balances automation with safety; allows low-risk automation; clear risk tiers	Requires careful risk classification upfront; can be inflexible; edge cases may fall between categories	Good reliability if risk classification is accurate; depends on correctly identifying high-risk actions
Contextual Autonomy	Risk varies significantly by context (user history, time, domain)	Provides flexibility whilst maintaining safety; improves user experience for trusted contexts; adapts to situational risk	Complex to design and maintain; requires robust context tracking; can appear inconsistent to users; risk of exploitation if context signals are gamed	Variable reliability depending on accuracy of context assessment; requires careful monitoring to detect when context rules are insufficient
Progressive Autonomy	Deploying new capabilities; building trust gradually	Builds confidence through proven track record; collects real-world performance data; allows course correction	Slow path to full autonomy; requires sustained monitoring effort; resource-intensive early phases	Excellent reliability that improves over time; reliability is measured and proven at each phase
Confidence-Led Autonomy	High-volume operations needing both automation and accuracy	Enables intelligent automated decision-making; scales well; handles edge cases without blanket rules	Technical complexity; additional computational cost (especially multiple sampling/committees); requires calibration phase; confidence scores can be misleading	Variable reliability depending on quality of confidence estimation; requires ongoing calibration; works best combined with other strategies

Constrained Autonomy

When to use: High-stakes decisions, new deployments, unclear situations

Define clear criteria for when an agent should escalate to a human: first X interactions with any new user, any decision with significant business impact, situations outside training domain, or when multiple decisions seem equally valid.

Example: In the education agent project, we needed to detect if a student discussed sensitive topics that may indicate professional support was needed. These were detected by a specific model and raised with the student’s teacher to follow up.

Bounded Autonomy

When to use: Agents taking actions in constrained domains

Give agents different permission levels based on risk: Can do automatically (low risk like search databases, summarise information), Can recommend but not execute (medium risk like create support tickets), Must escalate (high risk like anything involving money, data deletion).

Example: Customer service agent can answer FAQs automatically, suggest account changes for approval, but must escalate any refund requests over £100.

Contextual Autonomy

When to use: When context significantly affects risk

Adjust autonomy levels based on user history, time sensitivity, or domain. A refund request from a 5-year customer might auto-approve up to £100, whilst a new customer’s request always escalates.

Example: Financial transactions processed automatically during business hours with additional fraud checks, but flagged for manual review outside normal hours when fraud risk increases.

Progressive Autonomy

When to use: When deploying new agent capabilities

Start with high human oversight, gradually reduce as confidence grows: Phase 1 - Agent suggests, human reviews every decision. Phase 2 - Agent acts, human reviews sample + edge cases. Phase 3 - Agent acts autonomously, human reviews exceptions. Phase 4 - Full autonomy with monitoring.

Collect data and measure performance at each phase. Only progress when you’ve demonstrated reliability. This is how I approach deployment in high-stakes situations. You earn autonomy through demonstrated reliability, not by assuming it will work.

Confidence-Led Autonomy

When to use: High-volume operations needing automation with accuracy

While LLMs don’t give true confidence scores, there are various approaches to estimating confidence:

Multiple sampling: Run the same query multiple times. High agreement indicates confidence, disagreement signals uncertainty.
Committee approach: Combine decisions from multiple models. High agreement indicates confidence.
Token-level confidence: Examine probabilities at critical positions (function names, parameters). Low probabilities trigger review.
LLM-as-judge: Use an LLM to assess whether a generated function call is appropriate given user intent.

The computational cost matters. Multiple sampling and committee approaches can increase API costs 3-10x. For high-volume, low-stakes decisions, simpler token-level confidence might suffice. For rare but critical decisions, the extra cost of ensemble methods is justified.

In Practice: Combining Strategies

Real production systems typically layer multiple approaches. Start with bounded autonomy to define what’s possible. Add confidence-led autonomy to make intelligent decisions within bounds. Use constrained autonomy rules for specific high-risk scenarios. Deploy using progressive autonomy to prove reliability before scaling.

Questions to Ask About Your AI Agent

Before deploying an AI agent, work through these questions with your team. If you can’t answer them clearly, you’re not ready for production.

1. What’s the cost of a wrong decision?

Consider all dimensions:

Money lost directly?
Customer trust damaged?
Regulatory risk?
Potential harm to people?
Damage to your reputation?

Be specific. “It would be bad” isn’t sufficient. “We could lose a £50k account” or “We could violate GDPR” helps you calibrate appropriate oversight.

2. How will you know when it makes a wrong decision?

Will users complain?
Will monitoring alerts fire?
Will you find out in periodic audits?
Will you never know? (Big red flag if this is your answer)

If you don’t have a clear answer, you’re not ready to deploy.

3. Can you detect uncertain situations before they become wrong decisions?

Think about:

Edge cases in your data (rare scenarios, unusual requests)
Unusual user behaviour patterns
Conflicting information in your knowledge base
Questions outside your domain

Can your system recognise these and escalate them? Or will it confidently guess and hope for the best?

4. What’s your escalation path?

Practical details matter:

Who reviews escalated cases?
How quickly do they respond?
What happens if they’re not available?
How do you feed their decisions back into the system?

An escalation path that routes to a queue nobody checks is worse than useless—it gives you false confidence.

The Risk vs Automation Trade-off

There’s no universal right answer. The appropriate level of automation depends on:

Your risk tolerance: Regulated industries or high-stakes domains need more oversight
Your scale: 10 decisions per day vs 10,000 changes the calculation
Your capability: Do you have people available for review?
Your learning: As the system proves reliable, you can increase autonomy

For the education agent, we’re comfortable with high autonomy for factual curriculum questions (thousands per day, low risk) but maintain human review for anything involving student emotional wellbeing (lower volume, higher stakes). The goal isn’t to eliminate human involvement—it’s to ensure human involvement happens where it matters most.

Decision flow: Choosing the right level of AI Agent Autonomy

Start here: What happens if the AI makes a wrong decision?

Before deploying any AI agent, work through this decision tree with your team. Your answers will tell you exactly what level of oversight you need. Allow 15-20 minutes to discuss each path honestly—the cost of getting this wrong far exceeds the time investment.

Path 1: Low Stakes

Wrong decision = minor inconvenience, easily reversible

Example: Suggested product recommendations, content summaries, search results

→ Recommended Approach: Full Autonomy with Monitoring

Let the AI operate independently
Track performance metrics
Review patterns weekly/monthly

Questions to ask your team:

“What metrics will we monitor?”
“How will we know if performance degrades?”

Path 2: Medium Stakes

Wrong decision = customer frustration, rework required, but no lasting damage

Example: Customer service responses, routine task automation, internal process decisions

Ask: Can we detect wrong decisions quickly?

If YES (immediate feedback available):

→ Recommended Approach: Start with Human Review, Earn Autonomy

Phase 1: AI suggests, human reviews every decision (2-4 weeks)
Phase 2: AI acts, human spot-checks 20% (4-8 weeks)
Phase 3: AI acts autonomously, human reviews exceptions only

Questions to ask your team:

“What’s our success rate target to move to Phase 2?”
“How long before we have enough data to be confident?”
“What triggers an exception review in Phase 3?”

If NO (delayed or unclear feedback):

→ Recommended Approach: Set Clear Boundaries

Define which actions AI can do automatically (low-risk subset)
Define which actions require human approval (everything else)

Example: AI can send standard acknowledgement emails automatically, but any email containing pricing, commitments, or policy changes requires approval

Questions to ask your team:

“What’s the complete list of actions this AI can take?”
“Which ones could cause problems if done incorrectly?”
“Can we categorise these by risk level?

Path 3: High Stakes

Wrong decision = business risk, potential harm, regulatory issues, significant financial impact

Example: Financial transactions, medical advice, legal commitments, data deletion

Ask: Does the risk level change based on context?

If YES (context matters significantly):

→ Recommended Approach: Risk-Adjusted Oversight

Define high-risk vs. low-risk contexts upfront
Example: Refund requests under £50 from customers with 2+ year history = automatic. New customers or amounts over £50 = human review
Monitor for context signals that suggest higher scrutiny needed

Questions to ask your team:

“What contexts make this decision lower risk?”
“What warning signs should trigger extra scrutiny?”
“How will we detect if someone tries to game the context rules?”

If NO (consistently high-risk):

→ Recommended Approach: Always Require Human Decision

AI gathers information and presents options with reasoning
Human makes the final call
AI might rank options by confidence/suitability

Questions to ask your team:

“What information does the AI provide to help humans decide?”
“How do we ensure the human reviewer has proper context?”
“What’s the escalation path if the reviewer is uncertain?”

Red Flags That Should Make You Pause

Before approving any AI agent deployment, watch for these warning signs:

❌ “The AI is really accurate in testing” - but no plan for production monitoring

❌ “We’ll add human oversight later” - if needed, it should be there from day one

❌ “It’s low stakes” - but the team can’t articulate what happens when it goes wrong

❌ “We can’t measure confidence yet” - but they want to deploy to high-stakes decisions

❌ “Trust us, it’ll be fine” - without showing you the escalation paths and monitoring plan

The Bottom Line: More autonomy isn’t inherently better. The right answer depends on your specific situation. When in doubt, start with more oversight and earn autonomy through demonstrated reliability.

The Justified Perspective

The rush to deploy autonomous agents is ignoring this fundamental challenge. Companies are building impressive demos but aren’t thinking through what happens when the agent makes a confidently wrong decision in production. I’ve seen the pattern too many times:

Impressive capabilities demonstrated in controlled environments
Deployment without adequate oversight mechanisms
Edge cases discovered in production
Business impact from confidently wrong decisions
Retroactive addition of safety measures

The cost? Failed pilots, damaged customer relationships, wasted investment. McDonald’s likely spent millions on their drive-thru AI before shutting it down. How many other companies are quietly writing off similar investments?

It doesn’t have to be this way. I can now predict which companies will have production issues based on one question: “How do you know when your agent is uncertain?”

If they can’t answer clearly, they’re not ready to deploy.

What Good Looks Like

From my experience building production agent systems:

Start with human oversight - Don’t apologise for it. It’s good engineering.
Earn autonomy through demonstrated reliability - Measure performance, identify patterns, reduce oversight where it’s proven safe.
Never remove oversight entirely for high-stakes decisions - Even at full autonomy, maintain escalation paths and monitoring.
Build escalation paths from day one - Don’t treat them as failure modes—treat them as essential architecture.
Log everything so you can learn from mistakes - You will make mistakes. Make sure you learn from them.

Before You Deploy

Ask yourself one critical question:

“When this system makes a wrong decision, will I know about it before it causes problems?”

If the answer is no, you’re not ready to deploy.

If you can’t answer the question at all, you haven’t thought about risk management enough yet.

The goal isn’t to prevent all errors—that’s impossible. The goal is to design systems that fail gracefully, escalate appropriately, and improve over time.

That’s what “Justified” means: every capability is justified by appropriate risk management, every autonomy level is earned through demonstrated reliability, and every deployment decision is made with full awareness of limitations.

AI agents are the future. But the future needs better engineering around uncertainty, not blind faith in autonomy.

Ready to Build Your AI Agent the Right Way?

Before you deploy, let’s talk about the questions this article raised: How will you measure uncertainty? What’s your escalation path? How do you know when your agent shouldn’t act autonomously?

Talk to us and we’ll work through:

Your specific use case and risk profile
Which autonomy strategy fits your situation
How to build confidence measurement from day one
What “good” looks like for your deployment

We’ll help you think through risk management before you deploy, not after you’ve discovered problems in production.

What can Shakespeare teach us about planning successful AI projects?

Planning successful AI projects in complex technical and business environments is hard as its full of uncertainty.

Chris Ballard

Feb 16, 2024 - 1 Min read