What I Learned from Building a RAG-Based AI on My Own Work — And the Architectural Crossroads It Revealed

By Published On: August 20, 2025

Over the past few weeks, I’ve been experimenting with a Retrieval-Augmented Generation (RAG) system built on OpenAI’s platform. The idea was simple: create an AI assistant that could answer enterprise IT questions using only my published content—blogs, podcasts, and interviews. A way to scale The CTO Advisor without losing fidelity.

I went in with strict rules. I uploaded my corpus as structured files and told the model, in no uncertain terms:

👉 “Only use the information provided in these documents.”

That’s where things got interesting.

The results were useful, sometimes even insightful—but also exposed two fatal flaws:

  • Instructional Amnesia – The model frequently ignored my boundary. Instead of staying inside my uploaded sources, it pulled from its general training data. Sometimes it even contradicted the documents I provided. If it can’t obey a rule that simple, you can’t trust it to respect governance or compliance rules in production. 
  • Confident Fabrications – Worse, it invented things. Entire frameworks I’ve never written about, misquoting arguments from my own content—but delivering them with total authority. Imagine this happening with compliance manuals or HR policies: the AI effectively rewrites policy on the fly. 

This wasn’t just friction. It was a control problem. And it raised bigger architectural questions about where these systems should really live.

I made a modified version available for you to play around with. It’s what I use to create initial blog post drafts.

🔍 Retrieval Isn’t Enough

Even when the system did follow instructions, retrieval alone wasn’t the silver bullet.

I grounded the GPT with a clean NDJSON file of my content and enforced constraints:

  • Minimize hallucinations and adhere strictly to provided sources
  • Only respond from verified material
  • Cite exact titles and URLs 

This setup reduced fluff, but it didn’t eliminate issues. Retrieval mechanisms—especially those relying on vector search—are good at finding semantically similar text. But they often misfire when the strategic context doesn’t match exact phrasing.

For example, I asked: “What are some alternatives to VMware?”

The model returned a technically correct answer. But it missed one of my cornerstone analyses—the strategic fallout of the Broadcom acquisition and the rising viability of KVM. Why? Because my article didn’t literally use the phrase “alternatives to VMware.” It discussed the broader ecosystem implications. The model prioritized literal phrasing over nuance.

That’s not just a retrieval miss—it’s a business risk when precision is your differentiator.

⚖️ Fine-Tuning: A Tempting, But Complex Upgrade

I considered training a model to emulate how I think—my reasoning patterns around AI infrastructure tradeoffs, repatriation narratives, and multicloud strategy.

Here’s the tradeoff matrix:

Tradeoff Retrieval (what I used) Fine-Tuning
Setup time Minutes Hours to prep + train
Cost Free (in ChatGPT) Training + inference fees
Hallucination risk Low with good data Lower—but still needs validation
Style fidelity Instruction-based Emulated tone & structure
Corpus updates Upload & go Retrain required
Ease of deployment End user-level Custom depolyment/complex

Fine-tuning could help the model not just mimic my tone, but emulate how I structure arguments and synthesize complex tradeoffs. But it adds complexity: infrastructure, validation, dev time, and application-layer integration.

And that brings up a bigger decision: where does this thing live?

📏 The Architectural Crossroads: Local Control vs. Integrated Cloud Intelligence

My experiment is currently hosted on OpenAI’s infrastructure—great for rapid prototyping. But thinking like an architect about deploying this at enterprise scale, the logic for moving the model to local infrastructure becomes undeniable.

Imagine hosting this on dedicated AI hardware (like an NVIDIA DGX system), using an open-source framework like Ollama (a popular tool for running models locally), and a well-scoped 20-billion-parameter model (powerful enough for expert tasks without the overhead of massive frontier models). The case is strong:

  • Cost control: Token-based inference adds up quickly. A local model avoids the unpredictability of usage-based billing, especially as context windows expand.
  • Privacy & control: In enterprise settings with NDA-protected data, local hosting ensures full control over data handling and auditability.
  • Performance: Properly optimized local infrastructure can deliver comparable latency with greater customizability.
  • Domain alignment: This isn’t a generalist chatbot. It’s a focused expert system. A 20B model can handle that without excessive compute requirements.

My initial experiment points towards two distinct architectural paths forward, each with its own philosophy:

Path A: The Local-First Approach for Ultimate Control. This path, championed by infrastructure providers like Dell & HPE, prioritizes privacy, cost predictability, and granular control by bringing models on-prem. Using dedicated AI hardware and open-source frameworks like Ollama gives an organization full sovereignty over its AI stack. VMware’s Private AI initiative—especially when deployed on VMware Cloud Foundation (VCF) in partnership with NVIDIA and Dell—represents this strategy in action. It brings AI inferencing directly to environments already standardized on vSphere, offering enterprises a path to integrate LLMs into their existing infrastructure with confidence. For many high-trust workloads, this is a powerful and increasingly viable strategy.

Path B: The Integrated Cloud Platform for Advanced Intelligence. This is where a platform like Google Cloud’s Vertex AI presents a compelling alternative. Instead of treating retrieval as a simple file lookup, Vertex AI is engineered to tap into an enterprise’s entire data estate. Its integrated Enterprise Search and Vector Search capabilities are designed to understand context across diverse sources—databases, documents, and applications.

This approach could potentially resolve the very strategic-vs-literal retrieval issue I encountered with the VMware query. The hypothesis is that a platform deeply integrated with the data layer can deliver superior contextual relevance, bypassing the limitations of my initial, simpler setup. Similarly, AWS offers a modular path with Bedrock and SageMaker, giving teams the flexibility to build their own sophisticated RAG pipelines using a marketplace of models.

The fundamental tradeoff is clear: do you prioritize the self-contained control of local models, or do you bet on the integrated intelligence of a data-centric cloud platform to solve the hardest retrieval problems at scale?

Update: Since writing, I have chosen this path and will be releasing it to a larger audience for testing. 

👥 Beyond the Architect: What This Means for Talent and Teams

It’s one thing for a single architect to build a functional RAG system grounded in personal content. But scaling this approach across an enterprise requires a deliberate shift in how we think about roles, skills, and organizational structure.

The CIO’s Talent Mandate

As enterprises move from experimentation to production with LLMs, CIOs face a new kind of talent challenge—building interdisciplinary teams that blend:

  • AI platform engineers to manage infrastructure, model hosting (cloud or local), and GPU orchestration
  • Data engineers to curate, tag, and update the corpus used in RAG pipelines 
  • Prompt and retrieval specialists who understand how to optimize interactions, search relevance, and embedding tuning
  • Domain experts embedded into the loop to validate outputs and shape model behavior over time

This isn’t a pure AI/ML play. It’s a socio-technical system, and it demands tight alignment between IT, business, and content stakeholders. 

Note: My friends at Kamiwaza would note that having data engineers doesn’t mean you have to massage the data before ingesting it. It should be a function of your RAG pipeline with an heavy assist from AI.

From Experiment to Enterprise Playbook

What works for an individual (like me, with a well-defined personal corpus) doesn’t scale without abstraction and repeatability. Enterprises need:

  • Governance models for corpus access, updates, and validation
  • CI/CD for knowledge—treating corpus curation like software lifecycle management
  • Hybrid architecture patterns that account for cloud + edge + on-prem AI workloads
  • Training and upskilling pathways for existing staff—especially infra, data, and content roles

🧠 Takeaways for IT Leaders Testing GPTs

If you’re a CIO, enterprise architect, or infrastructure leader exploring AI integration, here’s what I’ve learned:

  • Metadata matters: Clean, well-structured, and phrase-aligned content is critical for accurate retrieval.
  • Citations break easily: Loose links or inconsistent titles result in missed insights.
  • RAG + fine-tuning = better scale: Hybrid strategies balance flexibility with depth.
  • Local isn’t fringe anymore: For high-trust, narrow-domain applications, local LLMs are no longer experimental—they’re strategic.
  • Local isn’t just about hardware; it’s about talent: Successfully deploying local models requires a new, interdisciplinary team structure blending platform, data, and domain expertise.
  • Advanced RAG is the real prize: The next frontier isn’t just a choice between local or cloud, but a test of which architecture—a controlled local model or a sophisticated, data-aware cloud platform—can best solve complex contextual retrieval.

We’re approaching a future where every serious IT advisory firm or enterprise CoE will need to operate its own AI stack. The drivers are clear: economics, compliance, and strategic control.

My initial experiment makes a compelling case for local models on the grounds of control. But it leaves a crucial, tantalizing question unanswered: can an enterprise-grade cloud AI platform solve the deep contextual retrieval problem that simple RAG cannot?

The next logical step is to run a head-to-head comparison. Testing this same ‘CTO Advisor’ workload on a platform like Google Cloud’s Vertex AI is the only way to determine which architecture truly delivers the highest fidelity at enterprise scale. That’s the experiment I’m designing next.

If you’re on a similar journey and want to shortcut the learning curve, I’m happy to share more.

Of course, if you need help on your journey, please reach out to [email protected]

Share This Story, Choose Your Platform!

RelatedArticles