I Just Wanted Endpoints

By Keith TownsendPublished On: April 23, 2026

I have an NVIDIA DGX Spark on my desk. It’s a serious piece of hardware — purpose-built for running AI workloads locally. It allows me to save money when all I need is a small classification model to tag 200K+ Tech Field Day video segments. I should be able to point my applications at it and get inference endpoints. Just as I’d use OpenAI, Anthropic, or Gemini. That’s what I want. Endpoints.

What I got instead is container orchestration.

I’m running vLLM containers for some models and Ollama for others. Each runtime has different resource characteristics, different memory footprints, different serving behaviors. The Spark has finite accelerator memory, which means I’m making constant decisions about what gets loaded, what gets swapped, and which runtime serves which model. I’m managing container lifecycles, allocating resources, and mentally tracking what’s running where so I don’t starve one workload to feed another.

I am the orchestration layer on my own hardware. And it’s a terrible job.

This isn’t a complaint about the DGX Spark. It’s a recognition that there’s a layer missing between the hardware and the endpoints I actually want — a layer responsible for deciding what runs where, how resources get allocated, and how multiple serving runtimes coexist on shared infrastructure. In 4+1 terms, that’s Layer 2C. The Reasoning Plane.

The Same Problem at Production Scale

I sat in a briefing with an AI video startup at Google Cloud that had this exact problem but at scale. They’re running stacks of models for a single workflow: video fusion, object detection, lighting, shadows, occlusion. Each model is potentially 10 to 100 gigabytes. Different models require different accelerator types and different amounts of memory.

Their CTO said something that hit home: he doesn’t want his AI engineers understanding Kubernetes. He wants managed infrastructure that works. He wants to focus on product and customers, not container runtimes.

But the nature of the workload forces partial retention of exactly the decisions he’d prefer to hand off. Model placement across heterogeneous accelerators, memory tiering for large model weights, cold-start optimization for interactive serving — these are decisions that require domain knowledge that the platform can’t fully abstract away yet.

Google is building toward solving this. Their Inference Gateway routes AI requests based on KV cache disposition and workload characteristics — yielding significant improvements in latency and cost. The Dynamic Workload Scheduler matches workloads to available resources and productizes accelerator availability through mechanisms such as Flex Start. They’re building managed KV cache tiering that traverses HBM, local storage, and managed storage. These are all Reasoning Plane capabilities — orchestration intelligence that sits between hardware and application.

What’s more impressive, I didn’t hear a single reference to tokens or cost of tokens. This CTO was taking full advantage of GPUs in Cloud Run, with GKE handling orchestration — the way I’d like to take advantage of my 128GB of GPU memory on my Spark.

But here’s what’s important: this customer’s experience and mine are the same problem at different scales. I’m manually orchestrating models across runtimes on a single node. He’s doing it across GPU clusters in the cloud. The gap is identical. Neither of us has a complete Layer 2C.

Naming the Gap

Layer 2C — the Reasoning Plane — is the orchestration intelligence responsible for workload placement, resource allocation, runtime coordination, and serving optimization across AI infrastructure. It’s not networking. It’s not the model itself. It’s the decision-making layer between compute and application that determines how infrastructure gets used.

The reason it matters as a distinct architectural concept is that it shows up at every scale. On my Spark, it’s the decisions I make manually about which models to load into which runtimes. At the Google Cloud customer’s scale, it’s the Inference Gateway and Dynamic Workload Scheduler. At hyperscaler platform scale, it’s the full suite of managed services that abstract accelerator topology, memory hierarchy, and serving optimization away from the practitioner.

The complexity isn’t proportional to scale. It shows up the moment you have more than one model and more than one runtime. My single-node DGX Spark proves that.

To be fair, the community is responding. Stacks like LiteLLM paired with llama-swap are emerging on DGX Spark specifically to provide a single OpenAI-compatible endpoint that dynamically loads and unloads models across runtimes. It’s real progress — endpoint abstraction over heterogeneous serving backends on a single node. But it’s routing and model swapping, not reasoning. It doesn’t factor in business context, SLAs, workload priority, or governance policy. It’s plumbing toward Layer 2C, not the layer itself.

Who Governs the Reasoning Plane

This is where DAPM — the Decision Authority Placement Model — applies directly.

On my Spark, I retain full governance over Layer 2C because no platform exists to delegate to. I decide what runs, where, and when. That gives me complete control and zero leverage.

The Google Cloud customer wants to cede that authority. And Google’s platform lets him delegate a significant portion of it — workload scheduling, cold start, resource matching. But he still retains model placement decisions and runtime configuration because the platform can’t yet fully abstract his domain-specific requirements. He’s in a hybrid governance posture — partially delegated, partially retained — and that’s the source of the operational friction he described.

Google, AWS, Azure — each is building their own Reasoning Plane, and adopting one means accepting their orchestration logic for your workloads. That’s a legitimate choice. The Dynamic Workload Scheduler decides where your models land. The Inference Gateway decides how your requests get routed. For many organizations, that’s the right tradeoff — you get to production faster by borrowing the cloud provider’s judgment about placement, scheduling, and optimization.

But it is borrowed judgment.

The Stepping Stone

This is where platforms like Kamiwaza enter the picture. Kamiwaza is packaging the Reasoning Plane for on-premises and edge AI infrastructure. The value proposition maps directly to the gap I’ve been describing: if you’re running your own accelerators and you don’t want to manually orchestrate container runtimes, model placement, and resource allocation — but you also don’t want to cede those decisions to a hyperscaler — you need a platform that provides Layer 2C as a managed capability while letting you retain governance authority.

That’s a stepping stone, and a meaningful one. You’re not building Inference Gateway from scratch. You’re not manually playing the role of the orchestration layer the way I am on my Spark. You’re adopting a platform that handles the Reasoning Plane decisions while keeping that authority within your infrastructure boundary.

The North Star

The Fourth Cloud framework describes the maturity model for this trajectory. At full Fourth Cloud maturity, an organization operates its own AI infrastructure with its own Reasoning Plane — making placement, scheduling, and optimization decisions based on its own business logic, its own data governance requirements, and its own workload priorities. It’s not borrowing judgment from a public cloud provider’s orchestration layer. It owns the decisions.

That’s the north star, not the starting point. Most organizations aren’t there, and shouldn’t pretend to be. But understanding the destination matters because it changes how you evaluate every intermediate step. When you adopt a hyperscaler’s Reasoning Plane, you should know what you’re delegating. When you evaluate an on-prem platform like Kamiwaza, you should understand it as a step toward retained governance over Layer 2C, not just an infrastructure purchase.

And when you find yourself on a Tuesday afternoon manually deciding which vLLM container gets priority on a DGX Spark, you should recognize that you’re doing the work of a layer that doesn’t exist yet on your hardware — and that every organization running AI models on their own infrastructure is making the same invisible decisions.

The question isn’t whether you need a Reasoning Plane. You already have one. The question is whether it’s you.

If you’re building the Reasoning Plane, I want to put you in front of the people who need it.

The Advisor Bench runs Buyer Rooms — structured sessions where enterprise CxOs and senior practitioners pressure-test vendor capabilities against real infrastructure decisions. If you’re building platforms that address the Layer 2C gap described in this piece — orchestration intelligence for AI workloads across on-prem, edge, or hybrid infrastructure — a Buyer Room puts your solution in front of the practitioners making those decisions today. No keynotes. No demos to a general audience. Direct, technical engagement with the buyers who are currently managing this layer manually and know exactly what’s missing. Contact me to learn more.

Keith Townsend

Keith Townsend is a seasoned technology leader and Founder of The Advisor Bench, specializing in IT infrastructure, cloud technologies, and AI. With expertise spanning cloud, virtualization, networking, and storage, Keith has been a trusted partner in transforming IT operations across industries, including pharmaceuticals, manufacturing, government, software, and financial services.
Keith’s career highlights include leading global initiatives to consolidate multiple data centers, unify disparate IT operations, and modernize mission-critical platforms for “three-letter” federal agencies. His ability to align complex technology solutions with business objectives has made him a sought-after advisor for organizations navigating digital transformation.
A recognized voice in the industry, Keith combines his deep infrastructure knowledge with AI expertise to help enterprises integrate machine learning and AI-driven solutions into their IT strategies. His leadership has extended to designing scalable architectures that support advanced analytics and automation, empowering businesses to unlock new efficiencies and capabilities.
Whether guiding data center modernization, deploying AI solutions, or advising on cloud strategies, Keith brings a unique blend of technical depth and strategic insight to every project.