
What can Shakespeare teach us about planning successful AI projects?
Planning successful AI projects in complex technical and business environments is hard as its full of uncertainty.

McDonald’s launched AI-powered drive-thru ordering across multiple locations. The technology was impressive — it could understand speech, process orders, and integrate with point-of-sale systems. Within months, the trial was quietly shut down.
The problem wasn’t technical capability. The AI confidently processed orders for bacon-topped ice cream. It added hundreds of McNuggets to orders without question. It served wrong items with complete certainty.
An experienced human would naturally flag these situations: “Did you really want 260 McNuggets?” The AI simply processed them, confident in every decision it made.
This is the gap causing AI agent deployments to fail across industries. Traditional machine learning could tell you “I’m 85% confident this is a phishing email.” You could set thresholds: above 85%, automate; below 85%, human reviews. This simple mechanism enabled our system at a global market research company to save several million pounds annually—we knew exactly when to trust the model and when not to.
AI agents powered by LLMs don’t work that way. Ask an LLM “how confident are you?” and it’ll give you an answer. But that answer is generated text, not a calibrated probability. It’s a plausible-sounding statement that could be completely wrong.
Without confidence measurement, you’re either over-investing in unnecessary human review (reducing ROI) or under-investing and discovering problems in production (destroying customer trust).
The issue isn’t that LLMs aren’t capable. It’s that they lack the built-in uncertainty signals that traditional ML provides, and most teams are deploying them without building those signals in.
This post shows you how to fix that—with practical strategies you can implement this week.
Traditional ML systems provide something beautifully simple: a confidence score for every decision.
When I built a system to classify grocery products at a global market research company, we needed to assign thousands of products to their correct brands. Getting Coca-Cola classified as Pepsi could misattribute millions in sales to rival brands. Accuracy mattered.
But we knew the system wouldn’t always be certain. So we designed it with a straightforward pattern:
This gave us the best of both worlds: automation where we’re confident, human expertise where we’re not. It led to several million pounds in cost savings precisely because we knew when to trust the model and when not to.
The business logic was simple: if you can’t measure confidence, you can’t automate safely. If you can measure it, you can build guardrails that deliver both speed and accuracy.
This is good engineering: understanding your system’s limitations and designing around them.
LLMs generate text token by token, predicting the next most likely word. They do have internal confidence scores for each token, but there’s a fundamental problem:
Knowing the model is 95% confident about the next word tells you nothing about whether the entire decision is correct. Here’s why this breaks traditional approaches:
An agent’s fundamental purpose is to carry out tasks autonomously. When you give an agent the ability to take actions—creating support tickets, processing refunds, sending emails—you’re asking it to make decisions without knowing how certain it is.
This is fine for low-stakes decisions. If the user wants “a good recipe for chicken,” there’s no absolute right or wrong answer. If the agent’s suggestion doesn’t align perfectly, no one is harmed.
But high-stakes decisions are different entirely.
I recently worked on an AI system for an education platform where students could ask questions about books they were reading. The agent needed to:
Uncertainty propagates through every step. We want to ensure we grade students’ answers correctly and that feedback aligns with their response, the content, and the curriculum.
To counter this, we built a multi-agent system where different specialists handled different aspects. We created an evaluation pipeline where one agent played the student (providing various types of responses) while another provided feedback. This let us measure accuracy and identify where the agent was uncertain—by investigating test cases where feedback was deemed inaccurate.
We only deployed when the agent met our success criteria. This is the engineering discipline that’s missing from most AI agent deployments.
The Core Issue: The AI couldn’t distinguish unusual orders from normal ones.
What Happened:
The Root Cause: Whether from speech recognition errors or genuine requests, the AI lacked the context to question unusual orders. It had no framework to recognize “this seems odd—I should check.”
The Business Impact: Trial shut down, investment written off, brand embarrassment in national press.
The Engineering Lesson: Humans naturally understand what’s normal. They question outliers. AI agents need this explicitly built in through confidence measurement and escalation paths—it doesn’t emerge automatically from better models.
AI has made it easier than ever to develop and deploy AI-enabled functionality. I’ve personally worked on projects where there’s immense pressure to deliver quickly, and AI-enabled coding agents enable software to be developed rapidly (so-called “vibe coding”).
However, across multiple projects and conversations with other teams, I see the same pattern:
I can now predict which companies will have production issues based on one question: “How do you know when your agent is uncertain?”
If they can’t answer clearly, they’re not ready to deploy.
The issue isn’t that LLMs aren’t capable. It’s that they lack the wide experiential context that humans naturally have, and need to have explicit protocols put in place to handle situations where they make confident but incorrect decisions.
Without confidence scoring and a framework to manage uncertainty, it’s impossible to know what the boundary conditions are until you hit them in production.
Without confidence scores, how do you know when to let your agent act autonomously versus when to pause for human review?
From my experience building agentic AI systems that use LLMs, the answer isn’t a single threshold or blanket rule. Instead, successful production systems use different autonomy strategies depending on risk, context, and deployment maturity.
These aren’t mutually exclusive. Real production systems typically combine multiple approaches—for example, using bounded autonomy to categorise risk levels, confidence-led autonomy for intelligent decision-making within safe boundaries, and progressive autonomy when deploying new capabilities.
Quick Assessment: If you’re early in your AI agent journey, start with Constrained or Progressive Autonomy. If you’re scaling production systems, combine Bounded Autonomy with Confidence-Led approaches.
Here’s how to choose the right approach for your situation:
| Strategy | Best Used When | Pros | Cons | Reliability Characteristics |
|---|---|---|---|---|
| Constrained Autonomy | High-stakes decisions, new deployments, regulated environments | Simple to implement; clear escalation paths; minimises risk | Can create bottlenecks; may frustrate users with delays; requires available human expertise | High reliability through human oversight, but limited by human availability and response time |
| Bounded Autonomy | Agents operating across varied risk levels | Balances automation with safety; allows low-risk automation; clear risk tiers | Requires careful risk classification upfront; can be inflexible; edge cases may fall between categories | Good reliability if risk classification is accurate; depends on correctly identifying high-risk actions |
| Contextual Autonomy | Risk varies significantly by context (user history, time, domain) | Provides flexibility whilst maintaining safety; improves user experience for trusted contexts; adapts to situational risk | Complex to design and maintain; requires robust context tracking; can appear inconsistent to users; risk of exploitation if context signals are gamed | Variable reliability depending on accuracy of context assessment; requires careful monitoring to detect when context rules are insufficient |
| Progressive Autonomy | Deploying new capabilities; building trust gradually | Builds confidence through proven track record; collects real-world performance data; allows course correction | Slow path to full autonomy; requires sustained monitoring effort; resource-intensive early phases | Excellent reliability that improves over time; reliability is measured and proven at each phase |
| Confidence-Led Autonomy | High-volume operations needing both automation and accuracy | Enables intelligent automated decision-making; scales well; handles edge cases without blanket rules | Technical complexity; additional computational cost (especially multiple sampling/committees); requires calibration phase; confidence scores can be misleading | Variable reliability depending on quality of confidence estimation; requires ongoing calibration; works best combined with other strategies |
When to use: High-stakes decisions, new deployments, unclear situations
Define clear criteria for when an agent should escalate to a human: first X interactions with any new user, any decision with significant business impact, situations outside training domain, or when multiple decisions seem equally valid.
Example: In the education agent project, we needed to detect if a student discussed sensitive topics that may indicate professional support was needed. These were detected by a specific model and raised with the student’s teacher to follow up.
When to use: Agents taking actions in constrained domains
Give agents different permission levels based on risk: Can do automatically (low risk like search databases, summarise information), Can recommend but not execute (medium risk like create support tickets), Must escalate (high risk like anything involving money, data deletion).
Example: Customer service agent can answer FAQs automatically, suggest account changes for approval, but must escalate any refund requests over £100.
When to use: When context significantly affects risk
Adjust autonomy levels based on user history, time sensitivity, or domain. A refund request from a 5-year customer might auto-approve up to £100, whilst a new customer’s request always escalates.
Example: Financial transactions processed automatically during business hours with additional fraud checks, but flagged for manual review outside normal hours when fraud risk increases.
When to use: When deploying new agent capabilities
Start with high human oversight, gradually reduce as confidence grows: Phase 1 - Agent suggests, human reviews every decision. Phase 2 - Agent acts, human reviews sample + edge cases. Phase 3 - Agent acts autonomously, human reviews exceptions. Phase 4 - Full autonomy with monitoring.
Collect data and measure performance at each phase. Only progress when you’ve demonstrated reliability. This is how I approach deployment in high-stakes situations. You earn autonomy through demonstrated reliability, not by assuming it will work.
When to use: High-volume operations needing automation with accuracy
While LLMs don’t give true confidence scores, there are various approaches to estimating confidence:
The computational cost matters. Multiple sampling and committee approaches can increase API costs 3-10x. For high-volume, low-stakes decisions, simpler token-level confidence might suffice. For rare but critical decisions, the extra cost of ensemble methods is justified.
Real production systems typically layer multiple approaches. Start with bounded autonomy to define what’s possible. Add confidence-led autonomy to make intelligent decisions within bounds. Use constrained autonomy rules for specific high-risk scenarios. Deploy using progressive autonomy to prove reliability before scaling.
Before deploying an AI agent, work through these questions with your team. If you can’t answer them clearly, you’re not ready for production.
Consider all dimensions:
Be specific. “It would be bad” isn’t sufficient. “We could lose a £50k account” or “We could violate GDPR” helps you calibrate appropriate oversight.
If you don’t have a clear answer, you’re not ready to deploy.
Think about:
Can your system recognise these and escalate them? Or will it confidently guess and hope for the best?
Practical details matter:
An escalation path that routes to a queue nobody checks is worse than useless—it gives you false confidence.
There’s no universal right answer. The appropriate level of automation depends on:
For the education agent, we’re comfortable with high autonomy for factual curriculum questions (thousands per day, low risk) but maintain human review for anything involving student emotional wellbeing (lower volume, higher stakes). The goal isn’t to eliminate human involvement—it’s to ensure human involvement happens where it matters most.
Before deploying any AI agent, work through this decision tree with your team. Your answers will tell you exactly what level of oversight you need. Allow 15-20 minutes to discuss each path honestly—the cost of getting this wrong far exceeds the time investment.
Wrong decision = minor inconvenience, easily reversible
Example: Suggested product recommendations, content summaries, search results
→ Recommended Approach: Full Autonomy with Monitoring
Questions to ask your team:
Wrong decision = customer frustration, rework required, but no lasting damage
Example: Customer service responses, routine task automation, internal process decisions
Ask: Can we detect wrong decisions quickly?
→ Recommended Approach: Start with Human Review, Earn Autonomy
Questions to ask your team:
→ Recommended Approach: Set Clear Boundaries
Example: AI can send standard acknowledgement emails automatically, but any email containing pricing, commitments, or policy changes requires approval
Questions to ask your team:
Wrong decision = business risk, potential harm, regulatory issues, significant financial impact
Example: Financial transactions, medical advice, legal commitments, data deletion
Ask: Does the risk level change based on context?
→ Recommended Approach: Risk-Adjusted Oversight
Questions to ask your team:
→ Recommended Approach: Always Require Human Decision
Questions to ask your team:
Before approving any AI agent deployment, watch for these warning signs:
❌ “The AI is really accurate in testing” - but no plan for production monitoring
❌ “We’ll add human oversight later” - if needed, it should be there from day one
❌ “It’s low stakes” - but the team can’t articulate what happens when it goes wrong
❌ “We can’t measure confidence yet” - but they want to deploy to high-stakes decisions
❌ “Trust us, it’ll be fine” - without showing you the escalation paths and monitoring plan
The Bottom Line: More autonomy isn’t inherently better. The right answer depends on your specific situation. When in doubt, start with more oversight and earn autonomy through demonstrated reliability.
The rush to deploy autonomous agents is ignoring this fundamental challenge. Companies are building impressive demos but aren’t thinking through what happens when the agent makes a confidently wrong decision in production. I’ve seen the pattern too many times:
The cost? Failed pilots, damaged customer relationships, wasted investment. McDonald’s likely spent millions on their drive-thru AI before shutting it down. How many other companies are quietly writing off similar investments?
It doesn’t have to be this way. I can now predict which companies will have production issues based on one question: “How do you know when your agent is uncertain?”
If they can’t answer clearly, they’re not ready to deploy.
From my experience building production agent systems:
Ask yourself one critical question:
“When this system makes a wrong decision, will I know about it before it causes problems?”
If the answer is no, you’re not ready to deploy.
If you can’t answer the question at all, you haven’t thought about risk management enough yet.
The goal isn’t to prevent all errors—that’s impossible. The goal is to design systems that fail gracefully, escalate appropriately, and improve over time.
That’s what “Justified” means: every capability is justified by appropriate risk management, every autonomy level is earned through demonstrated reliability, and every deployment decision is made with full awareness of limitations.
AI agents are the future. But the future needs better engineering around uncertainty, not blind faith in autonomy.
Before you deploy, let’s talk about the questions this article raised: How will you measure uncertainty? What’s your escalation path? How do you know when your agent shouldn’t act autonomously?
Talk to us and we’ll work through:
We’ll help you think through risk management before you deploy, not after you’ve discovered problems in production.

Planning successful AI projects in complex technical and business environments is hard as its full of uncertainty.
Let’s explore whether AI could help your business and how we could help you.
Talk to us