Does a 10M Token Context Window Kill the Need for RAG? Not Even Close

By Keith TownsendPublished On: April 7, 2025

There’s a line of thinking floating around that goes something like this:

“With LLaMA 4 offering a 10 million token context window, we don’t need RAG anymore.”

Let’s pump the brakes on that.

Yes, 10 million tokens is a massive leap. It’s roughly equivalent to 40MB of raw text. For reference, that’s about 80 full-length novels. You could paste the entirety of The Return of the King into a prompt and ask the model, “When do Frodo and Sam reach Mordor?”—and it would give you a solid answer, no external database required.

That’s a serious technical achievement. But let’s not confuse a bigger bucket with better plumbing.

Context Windows and RAG Are Solving Different Problems

A context window is temporary memory. It’s what the model sees right now to answer your prompt. If you cram 10M tokens into it, you’re handing the model everything at once and hoping it figures out what’s important.

But here’s the problem: you get charged for every single token you include in that context window. Every time you send a prompt.

So if you load 125,000 tokens to answer a simple question, you’re paying for all of them—even if the model only needed 2%.

That’s not just expensive. It’s inefficient.

Now contrast that with RAG—Retrieval-Augmented Generation. With RAG, you’re indexing your data (often with embeddings) and retrieving only the most relevant chunks when a user asks a question. Instead of feeding the model 40MB of raw context, you’re giving it the exact 1-2 paragraphs it actually needs.

It’s like going from scanning an entire book to jumping straight to the right page.

My Example: Using My Tweet History with ChatGPT

Let me make this real.

I loaded my entire Twitter archive into ChatGPT. That’s over 100MB of text—a full history of my thoughts, takes, and commentary over the years.

Now, if I want the model to help me write a blog post or generate a thread that sounds like me, I don’t shove the full 100MB into the prompt. That would be absurd—not to mention costly and slow.

Instead, I use retrieval. I search my tweet corpus for relevant content, pull back a few key posts, and drop those into the prompt. That gives the model enough signal to emulate my voice and stay consistent with past content—without overwhelming it or wasting tokens.

That’s RAG in action. Not a hack. Not a workaround. An intentional optimization.

RAG Is About Relevance and Scale

People treat RAG like it was invented to work around small context windows. But that’s a narrow view.

RAG is a retrieval architecture, not just a fallback. It’s what allows you to keep your model’s working memory lean and focused. And it becomes absolutely necessary once you start working with real-world, enterprise-scale data.

Think about a 300GB SharePoint library. You’re not loading that into a context window—10M tokens or not.

You need a way to query, retrieve, and embed only the relevant bits, dynamically, in real time. That’s RAG.

And if you try to load everything? You’re just introducing noise. Even a highly capable LLM can get lost in a haystack.

Use the Right Tool for the Job

Massive context windows are great for continuity—long conversations, multi-document reasoning, complex codebases.

But they’re not a silver bullet.

The real magic happens when you combine long context windows with smart retrieval. You use RAG to narrow the scope. You use the context window to reason about what matters now.

That’s how you build enterprise-ready, AI-native applications that are actually usable and cost-effective.

Final Thought

Let’s not throw out good architecture just because the hardware got better.

Context windows are growing, and that’s a good thing. But RAG isn’t going anywhere. It’s how we bridge the gap between everything we could use and just what we need.

If you want to see this broken down in a more conversational format, check out the video I recorded. I walk through examples like The Return of the King, my tweet history, and how this applies to real-world enterprise data scenarios.

Thanks for reading.

Keith Townsend

Keith Townsend is a seasoned technology leader and Chief Technology Advisor at Futurum Group, specializing in IT infrastructure, cloud technologies, and AI. With expertise spanning cloud, virtualization, networking, and storage, Keith has been a trusted partner in transforming IT operations across industries, including pharmaceuticals, manufacturing, government, software, and financial services.

Keith’s career highlights include leading global initiatives to consolidate multiple data centers, unify disparate IT operations, and modernize mission-critical platforms for “three-letter” federal agencies. His ability to align complex technology solutions with business objectives has made him a sought-after advisor for organizations navigating digital transformation.

A recognized voice in the industry, Keith combines his deep infrastructure knowledge with AI expertise to help enterprises integrate machine learning and AI-driven solutions into their IT strategies. His leadership has extended to designing scalable architectures that support advanced analytics and automation, empowering businesses to unlock new efficiencies and capabilities.

Whether guiding data center modernization, deploying AI solutions, or advising on cloud strategies, Keith brings a unique blend of technical depth and strategic insight to every project.