AI & MLFebruary 23, 2026

RAG Alone Won't Fix Healthcare AI Hallucinations — Here's What Will

RAG helps with unstructured knowledge, but structured clinical data needs a different approach. MCP tool-use gives LLMs direct access to authoritative sources.

RAG vs MCP: Two Architectures for Grounding Healthcare AI

RAG vs MCP for grounding healthcare AI

Vladislav Vashkevich made a sharp observation in a recent LinkedIn post: AI hallucinations cost healthcare teams up to 40% of their validation and launch time. His core argument — that the fix is system architecture, not prompt tuning — deserves more attention than it's getting. You can fine-tune prompts all day, add guardrails, chain-of-thought reasoning, and self-verification loops. But if the model is generating clinical facts from its parametric memory, you're patching symptoms instead of fixing the root cause.

This is an architecture problem. The model doesn't need to know less or be more careful — it needs access to authoritative data at inference time. Two architectural patterns exist for this: Retrieval-Augmented Generation (RAG) and tool-use via the Model Context Protocol (MCP). Both ground the model in external knowledge. They do it in fundamentally different ways, and each has a distinct sweet spot.

Most teams building healthcare AI are familiar with RAG. Far fewer have explored MCP tool-use, and almost nobody is talking about when to use which. That gap matters, because choosing the wrong grounding architecture for your data type is one of the fastest ways to ship a system that hallucinates with confidence.

RAG: Strong for Unstructured Knowledge, Weak for Structured Data

RAG works by embedding your knowledge base into a vector store, then retrieving the most semantically relevant chunks at query time and injecting them into the model's context window. For unstructured knowledge — clinical practice guidelines, systematic reviews, formulary policy documents — this is genuinely powerful. The model gets relevant context it wouldn't otherwise have, and the retrieval step means it's working from your curated sources rather than its training data.

The problems show up when you point RAG at structured clinical data.

Approximate matching is the wrong paradigm for exact identifiers. Vector similarity search finds semantically nearby content. That's great for "find me guidelines related to hypertension management in elderly patients." It's terrible for "look up NDC 58151-155." An NDC code isn't semantically similar to anything — it's an exact identifier that maps to exactly one product. Nearest-neighbor search over embeddings can return a close-but-wrong code, and in drug data, close-but-wrong is dangerous.

Stale embeddings create silent failures. The FDA's NDC Directory, CMS fee schedules, and SNOMED CT release on their own cadences — some daily, some quarterly. Re-embedding a large drug database is computationally expensive and operationally complex. Most teams batch this process, which means there's always a lag between when data changes and when your vector store reflects it. In healthcare, that lag can mean returning a discontinued drug or missing a new black box warning.

Retrieved chunks lose provenance. When you chunk a document and embed it, the relationship between a specific fact and its authoritative source gets muddled. You might retrieve a paragraph that mentions a drug interaction, but the embedding doesn't carry metadata about which FDA label revision it came from or when it was last verified. For clinical audit trails, this is a serious gap.

The interpretation layer reintroduces hallucination risk. Even with perfect retrieval, the LLM still has to read the retrieved text chunks and synthesize an answer. This is the last-mile hallucination problem — the model might misinterpret a negation, conflate two retrieved passages, or fill gaps between chunks with plausible-sounding but fabricated details. You've reduced hallucination surface, but you haven't eliminated it.

MCP Tool-Use: Direct Access to Authoritative Sources

The Model Context Protocol (MCP) is an open standard that gives LLMs direct access to external tools. Instead of searching a pre-built index, the model decides which tool to call and with what parameters, then receives structured data back from the authoritative source. Think of it as the difference between searching a photocopy of a reference book and calling the reference desk directly.

The mechanics are straightforward. The LLM receives a tool manifest describing available functions, their parameters, and what they return. When a query requires external data, the model generates a tool call. The MCP server executes that call against the real data source and returns structured results. The model then uses those results — not its own knowledge — to formulate the response.

For structured clinical data, the advantages are significant.

Exact lookups replace approximate matching. A call to ndc_get("58151-155") returns the precise FDA record for that identifier. There's no similarity threshold to tune, no risk of returning a neighboring code. The query either matches or it doesn't, which is exactly the semantics you want for identifiers, codes, and structured reference data.

Queries hit live data sources. There's no embedding pipeline to maintain, no batch reprocessing window, no staleness gap. When CMS updates a fee schedule or the FDA issues a new safety communication, the next tool call reflects that change. The freshness of your data is bounded by the source's update frequency, not your ETL pipeline's.

Every response carries full provenance. Tool responses include structured metadata — source authority, effective dates, version identifiers, direct citations. This isn't bolted on after the fact; it's intrinsic to the response format. When a clinician or compliance officer asks "where did this come from," the answer is specific and verifiable.

Structured responses eliminate the interpretation layer. The model receives JSON with named fields, not paragraphs of text it needs to parse. generic_name is a field in the response, not a fact the model extracts from prose. This collapses the hallucination surface for the structured data portion of the answer to essentially zero.

Same Question, Two Architectures

Let's make this concrete. A clinical decision support system receives the query: "What's the generic name and black box warnings for NDC 58151-155?"

The RAG path:

The FDA drug database has been chunked, embedded, and loaded into a vector store
The user query gets embedded into the same vector space
Vector similarity search retrieves the top-k most relevant chunks
Those chunks — fragments of FDA label text, possibly from multiple documents — get injected into the model's context
The LLM reads the retrieved text, identifies the relevant information, and synthesizes an answer
You hope it correctly extracted the generic name, didn't confuse label sections, and accurately represented the warnings

At each step, there's potential for information loss or distortion. The embedding might not prioritize the NDC code as a key identifier. The retrieved chunks might include partial label sections. The synthesis step might paraphrase a warning in a way that changes its clinical meaning.

The MCP path:

// Step 1: LLM calls the NDC lookup tool
ndc_get("58151-155")

// Response:
{
  "ndc": "58151-155",
  "brand_name": "Lipitor",
  "generic_name": "atorvastatin calcium",
  "labeler_name": "Viatris Specialty LLC",
  "dosage_form": "TABLET, FILM COATED",
  "route": ["ORAL"],
  "strength": "ATORVASTATIN CALCIUM TRIHYDRATE 10 mg/1",
  "rxcui": ["259255", "617310"],
  "meta": {
    "source_name": "FDA NDC Directory",
    "citation": "FDA NDC Directory. Accessed 2026-02-23 via FHIRfly."
  }
}

// Step 2: LLM calls the FDA label safety tool
fda_label_safety("58151-155")

// Response:
{
  "brand_name": ["Lipitor"],
  "generic_name": ["ATORVASTATIN CALCIUM"],
  "effective_time": "20240415",
  "sections": {
    "contraindications": [
      "Acute liver failure or decompensated cirrhosis...",
      "Hypersensitivity to atorvastatin or any excipient in LIPITOR..."
    ],
    "warnings_and_cautions": [
      "Myopathy and Rhabdomyolysis: Risk factors include age 65 years or greater...",
      "Immune-Mediated Necrotizing Myopathy (IMNM)...",
      "Hepatic Dysfunction: Increases in serum transaminases have occurred..."
    ]
  },
  "meta": {
    "source_name": "FDA DailyMed SPL",
    "set_id": "a60cc18b-0631-4cf0-b021-9f52224ece65",
    "version": "7"
  }
}

The model now formats its response from authoritative structured data. The generic name isn't interpreted from a text chunk — it's a field value from the FDA's own database. The absence of a boxed warning is an explicit null, not something the model inferred from not finding warning text in retrieved passages. Every fact is traceable to a specific source and version.

The hallucination surface for the structured data portion of this response is zero. The model isn't generating clinical facts; it's formatting them.

The Real Answer: Use Both

RAG and MCP tool-use aren't competing approaches — they solve different parts of the grounding problem. The choice depends on the nature of the data you're working with.

Data Type	Best Approach	Why
Clinical guidelines, protocols	RAG	Unstructured text, needs semantic search
Research papers, evidence	RAG	Long-form, needs relevance ranking
Drug codes, identifiers	MCP tools	Exact lookup, changes frequently
Provider directories	MCP tools	Exact lookup, real-time data
Billing/compliance rules	MCP tools	Structured rules, needs precision
Patient notes	RAG	Unstructured, facility-specific

The pattern is consistent: unstructured knowledge that benefits from semantic similarity search belongs in a RAG pipeline. Structured reference data with exact identifiers, frequent updates, and auditability requirements belongs behind MCP tools.

The strongest production architecture layers both. A clinical AI assistant might use RAG to retrieve relevant treatment guidelines and research evidence, then use MCP tools to verify drug codes, check for interactions against the live FDA database, and validate billing codes against current CMS schedules. The RAG layer provides clinical context and reasoning support. The tool-use layer provides factual precision and provenance.

This layered approach also maps cleanly to validation requirements. RAG-sourced information can be flagged as "retrieved from [source corpus], verify with authoritative reference." Tool-sourced information carries its own provenance and can be presented with higher confidence. The system can be transparent about where each piece of information came from and how it was obtained.

FHIRfly MCP: 28 Tools for Clinical Data

Building MCP tool integrations from scratch means normalizing across multiple federal data sources, handling rate limits and downtime, mapping between terminology systems, and keeping everything current. FHIRfly MCP provides this as a ready-made layer.

The server exposes 28 tools across five categories: terminology lookup, semantic search, FDA label retrieval, cross-terminology mapping, and claims validation. These tools draw from 12 authoritative data sources including the FDA NDC Directory, CMS HCPCS and fee schedules, NIH/NLM's RxNorm and SNOMED CT, and CDC vaccine databases.

Setup with Claude Desktop takes about two minutes — add the server configuration, and the model immediately has access to the full tool suite. No vector store to build, no embeddings to maintain, no chunking strategy to optimize.

In practice, the tools handle the queries that structured clinical data demands. A drug safety check calls ndc_get followed by fda_label_safety and returns FDA-sourced warnings with full citation metadata. A claims compliance workflow calls hcpcs_get to validate procedure codes against current CMS definitions. A cross-terminology mapping calls rxnorm_get and walks the concept relationships to connect NDC codes to RxCUIs to ingredient-level classifications — all from NLM's authoritative data.

Each tool response includes source attribution, effective dates, and version identifiers. The provenance chain is built into the protocol, not added as an afterthought.

Key Takeaways

Architecture matters more than prompt engineering for reducing hallucinations. System design determines the ceiling on accuracy; prompt tuning optimizes within that ceiling.
RAG and MCP tool-use solve fundamentally different parts of the grounding problem. Treating them as interchangeable leads to using the wrong tool for the job.
Structured clinical data should come from authoritative APIs, not vector stores. Drug codes, billing identifiers, and safety data need exact lookup, real-time freshness, and full provenance — none of which are strengths of embedding-based retrieval.
FHIRfly MCP provides 28 ready-made tools for the most common clinical data needs, backed by 12 federal data sources, with provenance built into every response.

For implementation details and the full tool reference, see the FHIRfly MCP documentation.

Tagsmcpraghallucinationshealthcare-aiclinical-data

Written by The FHIRfly Team — healthcare data, AI, and interoperability folks building better clinical coding APIs.

Build it on real terminology

Try any endpoint live — no sign-up required.

Open the playground →Read the docs