Hybrid Is Not a Place: The AI Architecture Google and Dell Both Pointed Toward

By Published On: May 20, 2026

Google positioned Gemini 3.5 Flash as frontier-class agentic performance at half to a third the cost of comparable OpenAI and Anthropic models. Dell Technologies positioned Deskside Agentic AI as a way to save up to $1M over a two-year period if you build your agentic code pipeline on their Dell Pro Max GB300. If you listened to both companies coming out of their premier user events, you’d walk away thinking the sweet spot is running Gemini-class models on Dell hardware. I’d forgive you for that assumption.

The reality is that the truth lies between these two marketing messages.

The problem is that AI spend for agentic coding is out of control. As of late May 2026, the time of this writing, organizations are reporting that they have exhausted all of their budget for agentic AI coding tokens. The simplistic solution would be to buy cheaper, more capable models from Google or to run your AI locally on Dell. My testing reveals something more nuanced.

We are back in the world of hybrid cloud, just for the AI revolution. However, we are 16 years into the dance between cloud providers and on-prem data center vendors. The industry has figured out the workload patterns and built abstractions and operating models that demarc between the workloads. We are early in AI and haven’t figured out the shape and size of the workloads while the capabilities advance faster than we can operationalize.

From the two events and my research, we can see the form of the operating model that keeps you ahead of the curve. It’s all about determining who decides done, so that you have the proper Layer 2C control plane for determining where to place your agentic workloads.

What Google Is Actually Selling

Gemini 3.5 Flash is not a cheap model. At $1.50 per million input tokens and $9.00 per million output, it’s 3x more expensive than the previous Flash. But it beats Gemini 3.1 Pro on coding and agentic benchmarks while running roughly 4x faster — around 284 tokens per second. Google is calling this frontier-class performance at half to a third the cost of OpenAI and Anthropic.

How can they do that? They built and control the whole stack. Their silicon, their data centers, their models. We’ve seen this play with AWS Lambda and Graviton. Except this looks much closer to what Apple does with their Apple Silicon to OS to Services integration.

Google services billions of users on the underlying Gemini model on TPUs across all of their services. While not exactly a rounding error, it’s an AI infrastructure cost structure advantage that doesn’t go away. It gets wider as inference demand scales. When your competitor is renting GPUs and you’re fabbing TPUs, you’re playing a different game.

Google is already building the cascade, too. At I/O they showed Gemma 4 on Cloud Run handling lightweight agentic work, escalating to Gemini 3.5 Flash for harder reasoning. Both in the cloud. Google’s golden path is a capability cascade within their own infrastructure from Android, to Cloud Run, to Gemini. They don’t need you to buy hardware. They need you to stay on their platform while they give you the cheap tier and the expensive tier under one roof.

What Dell Is Actually Selling

Dell’s economics case comes from a Signal65 study. Three agent personas — knowledge worker, sales agent, software development — running persistently 24 hours a day, 260 days a year. For software development on the GB300 Ultra, Signal65 found $1.06M in cloud cost versus $138K on-prem over two years. That’s up to 87% lower cost with payback in 3 to 11 months. The Dell Pro Max with GB10 — same chip as my DGX Spark — starts at $3,699.

I have to be clear here, this isn’t a realistic scenario. Anything with this type of demand would see a much more robust design that goes well beyond the single deskside failure domain. But the numbers are directionally right. Agentic token consumption compounds. A single developer at Dell burned through 1 billion tokens in 24 hours — $3,400 cloud bill. Multiply that across a team and cloud-only is not a strategy. It’s a budget surprise.

Dell’s golden path is also a cascade — start on a deskside workstation, scale to PowerEdge servers in the data center. Own the hardware, stop paying per token. Same thesis as Google, opposite side of the transaction.

What Both Miss

Both companies’ models fall into the same homogeneous trap that their marketing teams love. Google’s cascade stays within Google. Dell’s cascade stays within Dell. Neither models what happens in an already well-established pattern for enterprise workloads over the past 16 years of hybrid cloud. There will be boundaries based on workloads, and the judgment of where those boundaries fall is the work.

My experiment shows that matters. Last week I published “Who Determines Done?” — results from running real bug-fix commits from open source repos as coding tasks on my DGX Spark, with frontier escalation to o3 and gpt-5.5. Ten milestones, two workloads, six local model configurations. Total API cost for the entire project: $4.60 across 156 calls.

The difficulty distribution was bimodal. About 45% of tasks were trivially easy — local model, first attempt, done. About 50% were too hard for the local model no matter how many times you retry. The middle band where a repair loop actually helps? About 5%.

That thin middle band matters. The industry is building agentic architectures optimized for the 5% case and paying for it on 100% of tasks.

The three-tier escalation chain — local model gets one shot, o3 gets up to two, gpt-5.5 as the final fallback — took pass rate from 25% to 75%. The local tier cost nothing beyond hardware I already owned. The frontier tier was cheap because most tasks never got there.

Here’s the math that begins to show the patterns we’ve seen for the past 16 years, just adapted for hybrid AI. If ~45% of real coding tasks resolve locally at near-zero marginal cost and another ~25% resolve with a single frontier call at $0.04 to $0.24, the blended cost per finished job is dramatically lower than either the all-cloud or all-local number. You’re not comparing “$135K hardware versus $1M in cloud APIs.” You’re comparing “$135K hardware plus a few hundred dollars in targeted frontier calls versus $1M in cloud APIs where most of that spend is frontier-tier pricing on tasks a 30B model handles fine.”

Where Flash 3.5 Fits

Flash 3.5 changes the Tier 2 calculus. My experiment used o3 as Tier 2 and it worked — o3 is not the newest frontier model, it’s much cheaper than the latest ones, and it has a high success rate on coding tasks relative to cost. That’s the “Cost per Finished Job” argument. You don’t need the most capable model. You need the cheapest model that passes the exit gate.

Flash 3.5 at frontier-class capability and half to a third the cost? That could replace o3 as Tier 2 for a lot of tasks. I haven’t tested it yet. But the architecture is designed for exactly this — swap the model, keep the validator, measure the pass rate, compare the cost. The tier is a slot. The model is replaceable.

The Flywheel

The sweet spot right now is what I’m actually doing: running local models on a Spark for initial coding passes, then escalating to frontier models for the hard stuff. The local model runs at near-zero marginal cost after the hardware investment. The frontier calls are cheap because most work never gets there. The hardware pays for itself not because of some abstract TCO model, but because you stop paying per token for work a local model handles.

My math: the Spark pays for itself after roughly 28,000 requests that would have otherwise gone to a frontier API. That number gets better as local models improve. Six months ago I couldn’t run Gemma 4 MoE on this hardware. Now it handles 45% of real bug-fix tasks on the first attempt.

As local models get better, the percentage that resolves at Tier 1 goes up. The frontier spend goes down. The breakeven accelerates. That’s the flywheel neither keynote articulated.

Hybrid Is Not a Place

Google’s own cascade doesn’t run entirely in the cloud. I sat in a session where Google showed the cascade from an Android device to a laptop to Cloud Run using Gemma 4 to Gemini 3.5 Flash. Same architecture, same escalation pattern, different infrastructure. The decision about where to run is economic and governance, not architectural.

Hybrid is not a place. It’s an operating model. The same escalation architecture works whether your Tier 1 is a Spark in your lab or a Gemma 4 container on Cloud Run. It all returns to Layer 2C — the orchestration layer that decides which model, at which tier, gets the next attempt. That layer doesn’t care where the model runs. It cares about capability, cost, and the exit condition.

Who Determines Done

Stop defaulting to the most expensive model. Build a validator that determines “done.” Route to the cheapest tier that passes the validator. Escalate on failure.

The local tier handles more than you think — 45% of real-world coding tasks in my testing. The frontier tier is cheaper than you think — o3 at $0.04-$0.24 per task, not the $3,400/day horror story. The middle is where Flash 3.5 competes — frontier capability at structurally lower cost because Google owns the silicon.

The question across all of this is the same one I keep coming back to: who determines done? The validator determines done. The escalation policy determines what happens when done is not reached. The infrastructure underneath shifts the economics of both decisions. What counts as “done” changes as model capabilities change. But infrastructure matters because it determines what you can afford to attempt.

If you’ve been in enterprise IT long enough, you’ve seen this movie. Early cloud had the same fights. Build versus buy. On-prem versus public. Hybrid as a compromise versus hybrid as a real architecture. Took a decade to figure out the answer was not a location — it was a governance model. We’re watching the same thing play out with AI inference right now. The patterns aren’t set. Build the governance layer first. Let the infrastructure decisions follow.

Dell gave us the hardware economics. Google gave us the cheap cloud tier. My lab work points to the missing piece between them: a capability cascade with deterministic exit gates.

The full experiment, methodology, and raw data are here. The escalation pattern is L1/L2/L3 support — every enterprise architect already knows it. The validator is the authority. The loop is plumbing. The question is still who determines done.

That’s not either company’s story to tell. That’s what enterprise architects have to build.

 

Share This Story, Choose Your Platform!

RelatedArticles