Who Determines Done? Why Agentic AI Needs Escalation, Not More Loops
Last week I did something unusual. I published a requirements document for an experiment I had not run yet. The question was simple: where should judgment live in a local-first agentic AI system?
I ran the experiment. Ten milestones, two workloads, six local model configurations including Nemotron, Gemma, Qwen, and LLaMA variants, frontier escalation to o3 and gpt-5.5, all on a DGX Spark in my lab. Total API cost for the entire project: $4.60 across 156 calls, from the actual OpenAI billing export. Well, and the $4,000 I paid for the Spark :)
The answer was not what I expected. The result was not a better loop. It was a capability cascade with deterministic exit gates.
I tested several approaches for deciding when to trust the local model and when to escalate to a stronger model. The final design was a three-tier escalation chain that tries once, validates the result, and moves up only when needed. The first workload was RSS triage scored against my own editorial judgment. That exposed the failure modes but hit a ceiling — the answer key was subjective. The second workload was coding, where executable tests serve as the authority signal.
The coding workload had its own progression. I started with synthetic isolated tasks. Small, clean, well-specified. The local model passed everything on the first attempt. The loop never fired. The experiment was useless because the tasks were not hard enough to test loop control.
The real signal came from git archaeology. I extracted actual bug-fix commits from open source repos — httpx, more-itertools, click, requests, arrow, dateutil, prettytable, marshmallow, humanize. Real bugs that real developers filed PRs for. I rebuilt them as self-contained tasks with the original regression tests and feature tests. Real messy code, real regression risk, real complexity. That is where the loop architecture was actually tested.
BTW, how did I do all of this in four days? Claude Code. So, there is my disclaimer. This is directional. I did not go through every revision, check the code, and test to ensure accuracy.
Here is what I found.
The Industry Is Fixated on Loops
The agentic AI conversation right now is about loops. Tool-calling loops, ReAct loops, multi-agent loops, self-reflection loops. The assumption is that if you give a model tools and a loop, it will converge on the right answer. The loop is the innovation.
That assumption drove my experiment. I went looking for who should control the loop. Local model? Frontier model? Deterministic code? Some hybrid?
What I found is that the loop is plumbing. The interesting question is downstream of the loop.
Who Determines Done?
The question that actually matters is not who controls the loop. It is who determines done.
Deterministic code is surprisingly good at this. Tests pass or fail, schema validates or it does not, format checks, regression checks, diff-scope checks — these are machine-enforceable exit conditions. On my coding workload, deterministic validation accepted roughly half the tasks on the first local attempt across the broader extracted task set. In the harder subset I used for the final escalation comparison, the local pass rate was lower — 25% — because the easy tasks had already been filtered out during calibration. The model wrote code, the validator accepted it, done.
One important caveat: “done” only means as good as the authority signal. In coding, tests can define done with useful precision. In RSS triage, schema validity only proves the output is well-formed. It does not prove the routing decision is correct. That is why the architecture worked better once the workload moved from subjective editorial judgment to executable validation.
These are not synthetic difficulty labels. The tasks that pass trivially are real Python bugs from real PRs — a bytes in versus == check in h2, an iterator exhaustion bug in more-itertools, a type-check guard in PyJWT. Actual developers fixed these. The local model fixes them too, first try, every time. The tasks that fail at zero percent are also real — a parser state machine in httpx, a combinatorics algorithm in more-itertools, arithmetic reasoning in dateutil. Real developers struggled with these too.
The difficulty distribution turned out to be bimodal. Across 28 tasks extracted from 9 repos:
| Difficulty Band | Tasks | Share |
|---|---|---|
| Trivially easy (100% first-attempt pass) | ~13 | ~45% |
| Too hard for local model (0% first-attempt pass) | ~14 | ~50% |
| Medium — repair loop adds value | ~1–3 | ~5% |
That thin medium band matters. It means the repair loop’s useful range is vanishingly small. Most tasks either resolve immediately or require a fundamentally more capable model. More on that in a moment.
The exception is quiet failures. Confident wrong answers that pass all structural checks. The model guesses, the validator sees nothing wrong, and the output looks clean. This is real. In the RSS triage workload, two items were routed incorrectly but passed every deterministic check — no schema failure, no proxy trigger, no signal for the control loop to act on. That limitation is genuine. But it is the minority case, not the common one.
The validator is the unsung hero of this architecture. Not the loop.
Looping Harder Does Not Help. It Can’t.
The industry assumption is straightforward: if the model gets it wrong, feed the error back and let it try again. The loop will converge.
My data says it does not. And on further testing, I found out it cannot — at least not at temperature zero.
Near-miss tasks proved it cleanly. One task passes 6 of 7 feature tests on the first attempt. So close. Feed the failing test output back, try again. Same result. Try again. Same result. Four attempts, identical output every time. Another task passes 4 of 5 tests on the first attempt. Four attempts, identical output every time. At temperature zero, the model behaved deterministically even after error feedback was added. In practice, the repair loop converged to the same wrong answer.
Earlier milestones hinted at this more gently. The few tasks that did benefit from a repair loop converged on attempt two. Never attempt three or four. Those were cases where the model understood the fix conceptually but made an implementation error — a wrong character set in an RFC formula, a missing regression guard in a combinatorics function. One round of test feedback was enough. But that sweet spot turned out to be roughly 5% of tasks across 28 extractions from 9 repos. For everything else, the model either gets it right on the first attempt or it is stuck at a capability boundary that no amount of self-repair will cross.
Universal review made it worse. On the RSS triage workload, I tested a frontier model as an always-on governor reviewing every local output. Accuracy dropped from 8/10 to 6/10. I tested a local critique pass on every output. Also dropped to 6/10. Both exhibited the same failure modes. Review Drift: the reviewer challenges correct decisions, introducing new errors into previously clean outputs. Over-saving Bias: reviewers aggressively challenge “ignore” decisions but rarely challenge “save” decisions, promoting low-value items to avoid the risk of missing something.
I also tested whether the critique source matters. Self-feedback and external frontier critique produced identical outcomes on every task. The separation-of-duties thesis — that a model cannot judge its own work — was neither confirmed nor refuted. The critique source is not the variable. What matters is whether the model can act on the correction signal. At temperature zero, if it could not act on it the first time, it will not act on it the fourth time either.
Escalate, Do Not Loop
Once I understood that same-tier repair loops had a vanishingly narrow useful range at temperature zero, the architecture simplified dramatically. Do not loop the same model. Escalate to a more capable one.
I built what I called modeE: a three-tier escalation chain. Tier 1 is the local model — Gemma 4 MoE, self-hosted on the DGX Spark, free — and it gets one attempt because retrying a deterministic model is provably useless. Tier 2 is a frontier reasoning model — o3 via the OpenAI API — and it gets up to two attempts when test feedback is available, because o3 is non-deterministic and benefits from error steering. Tier 3 is the most capable model available — gpt-5.5 — as a final attempt only if needed. The deterministic test harness gates promotion between tiers. Each tier starts from a clean workspace with no context from prior tiers.
| Architecture | Pass Rate | Avg Cost per Task |
|---|---|---|
| Local only | 25% | $0.00 |
| Local + repair loop | 25% | $0.00 (same outcome) |
| Three-tier escalation | 75% | $0.13 |
The two-tier jump — local to o3 — provided all the value. gpt-5.5 did not solve any task that o3 failed on in this test set. The architecture is simple. Attempt, test, pass or escalate. No same-tier local repair loops. No always-on governors. No universal critique passes. A capability ladder with a deterministic exit gate at every rung.
Two tasks defeated all three tiers. One had a backwards feature test that makes it unsolvable regardless of model capability — a test quality issue, not a model issue. The other is a specific edge case blind spot shared across model families. All models handle the non-empty case correctly but miss the empty guard. Genuinely unsolvable tasks exist. The architecture has a ceiling.
That first failure points to a broader lesson. The validator becomes the authority, but that also means validator quality becomes part of the system’s correctness boundary. Deterministic code does not magically remove judgment. It moves judgment into tests, schemas, validators, and policy code. If the test is wrong, the architecture faithfully enforces the wrong answer.
Every enterprise architect already knows this pattern. It is L1/L2/L3 support. The help desk agent works from scripts. If the script does not resolve the ticket, the agent does not try the script harder. They escalate to L2. L2 has more capability and context. If L2 cannot close it, L3 gets it. Nobody calls that a loop. It is a tiered resolution architecture.
The validator plays the same role as the runbook in L1 support. It determines whether the issue is resolved at this tier. The escalation trigger replaces the “I cannot solve this” decision. The Decision Authority Placement Model is the formal version of what IT operations teams already do with human labor.
But What About Claude Code?
An obvious objection: Claude Code loops and it works extremely well. I used it to build this entire experiment in four days. If repair loops are dead, why does the most visible agentic coding tool on the market run one?
Because Claude Code’s loop is not a mechanical repair loop at temperature zero. It is a conversation. The developer reframes the problem, adds context, changes the prompt between attempts. The important distinction is that the state of the problem changes between attempts. Each iteration has genuinely different input, which means genuinely different output. That is not looping. That is a human providing escalation. The developer is L2.
I tested this directly. I gave o3 a second attempt on a task it failed, feeding back the test errors from the first attempt. o3 is non-deterministic — unlike the local model at temperature zero, it produces different output on the same prompt.
| Run | Attempt 1 | Attempt 2 (with feedback) | Result |
|---|---|---|---|
| 1 | fail (4/5 pass) | pass (5/5) | feedback fixed it |
| 2 | empty files | empty files | o3 refused both times |
| 3 | fail (4/5 pass) | pass (5/5) | feedback fixed it |
When o3 produces code but gets one test wrong, the error feedback consistently fixes it on the second attempt. Two of three runs recovered. The only failure was when o3 returned empty files — a different failure mode entirely where the model would not engage with the task.
The design rule is now empirically grounded on both sides. Deterministic workers — local models at temperature zero — get one attempt. Retrying is provably useless. Escalate immediately on failure. Non-deterministic workers — frontier models, or a human reframing the problem in Claude Code — benefit from a second attempt because each iteration is genuinely different. The error feedback steers the model somewhere new.
The repair loop did not fail because looping is wrong. It failed because looping a deterministic process is wrong. Match your retry policy to the worker’s determinism. The validator does not care either way. It just checks whether the output passes.
Tool Calling Should Be Treated Like L3 Work
Agentic tool calling is often the most expensive type of work in this architecture, and not just because of compute. Every tool call adds coordination overhead: parsing, execution, error handling, state management, permissions, observability, and another place where the system can fail. Complexity is a cost, even when the token bill looks cheap.
From the architecture’s perspective, agentic tool calling is irrelevant as a category. It is just one level of capability in the abstraction. The worker at any tier might use tools. A local model writing code is a worker. A complex agentic system with tool calling and RAG is a worker. The architecture does not care what happens inside the tier. It cares about the exit gate and the escalation policy.
But from a cost perspective, tool calling matters a lot. It is the most expensive tier. Running agentic tool-calling loops on tasks that a single-pass classifier or code generator would handle is paying L3 prices for L1 work.
A quarter of my coding tasks — real bugs from real repos — are solved by the local model on the first attempt. No tools, no loop, no agent, no cost. Running those tasks through an agentic architecture would add cost and latency for zero improvement in outcome. Another half are solved by adding a single frontier call at $0.04 to $0.24 per task. Only the genuinely hardest tasks need the most expensive tier, and even then, some are unsolvable by any model.
This connects to the cost-per-finished-job argument. The industry prices AI by the token. Genius per token. The capability cascade prices it by task resolution. The cheapest resolution is not merely the one with the lowest token cost. It is the simplest path that satisfies the exit condition: a single-pass model checked by deterministic code, when that is enough. Cost-per-finished-job includes compute, latency, operational complexity, failure surface, and governance burden. Tool-calling agentic systems are the expensive tier you invoke only when cheaper tiers cannot satisfy the exit condition. This is not anti-agent. It is anti-agent-by-default.
The purchasing decision is familiar. Nobody sends every support ticket to the engineering team. Nobody flies a consultant in for a password reset. Defaulting to the most expensive resolution tier for every AI task is the same mistake enterprises learned to stop making with human labor decades ago.
From the Outside, This Is How Thinking Models Already Work
Chain-of-thought. Internal verification. Escalation from fast System 1 to slow System 2. Spending more compute when the cheap pass does not satisfy the constraint.
From the outside, thinking models appear to follow a similar economic pattern. From the outside, they appear less like same-tier retry loops and more like systems that spend additional compute when the cheap pass is not enough.
The temperature-zero repair loop failure proves this from the negative side. A model that loops at the same capability level without escalating is doing exactly what my local repair loop did — producing the same wrong answer repeatedly. The observable value of thinking models is that they spend more compute when the cheap pass is not enough. My experiment rediscovered that principle externally, the hard way. And when I tested o3 with a second attempt, the non-deterministic frontier model behaved the way you would expect — it produced different output, and the error feedback steered it to the right answer.
The architecture I built is the same pattern, just distributed across hardware tiers with the validator as an explicit governable component instead of a hidden internal mechanism.
The difference that matters for enterprises: when the cascade is internal to the model, you cannot see the exit gate, you cannot audit the escalation decision, you cannot measure the cost at each tier, and you cannot place authority deliberately. Thinking models solved the capability cascade for themselves. Enterprises need to solve it for their agent architectures — and they need it to be visible, auditable, and governable.
What This Means for Layer 2C
Layer 2C in the 4+1 model — the reasoning plane — is not one model reasoning harder. It is the orchestration of a capability cascade with deterministic exit conditions at every tier.
The components are straightforward. A validator that determines doneness. An escalation policy that decides when to swap capability tiers. Workers at each tier that are interchangeable labor. Whether a worker is a local classifier, a cloud code generator, or a full agentic system with tool calling is an implementation detail scoped to that tier.
The pattern is fractal. Local model to cloud model to frontier reasoning model. Or 4o to o3 to 5.5. Or a local classifier to a local agentic system to a cloud agentic system. Same validator, same escalation policy, different capability at each level.
Authority placement follows the Decision Authority Placement Model. Authority over “is this done” lives in the validator. Authority over “escalate to a more capable tier” lives in the escalation policy. The model at each tier has authority over bounded work — nothing more.
The enterprises that understand this will build governable agent systems that resolve most tasks cheaply and escalate deliberately. The ones chasing loop demos will keep paying L3 prices for L1 work and wondering why their agents confidently do the wrong thing.
Appendix – Data
Share This Story, Choose Your Platform!

Keith Townsend is a seasoned technology leader and Founder of The Advisor Bench, specializing in IT infrastructure, cloud technologies, and AI. With expertise spanning cloud, virtualization, networking, and storage, Keith has been a trusted partner in transforming IT operations across industries, including pharmaceuticals, manufacturing, government, software, and financial services.
Keith’s career highlights include leading global initiatives to consolidate multiple data centers, unify disparate IT operations, and modernize mission-critical platforms for “three-letter” federal agencies. His ability to align complex technology solutions with business objectives has made him a sought-after advisor for organizations navigating digital transformation.
A recognized voice in the industry, Keith combines his deep infrastructure knowledge with AI expertise to help enterprises integrate machine learning and AI-driven solutions into their IT strategies. His leadership has extended to designing scalable architectures that support advanced analytics and automation, empowering businesses to unlock new efficiencies and capabilities.
Whether guiding data center modernization, deploying AI solutions, or advising on cloud strategies, Keith brings a unique blend of technical depth and strategic insight to every project.




