HEALTHCARE AI

Clinical NLP Has Hit the Accuracy Wall—Here's What Breaks Through

June 27, 2025 | Michael Torres | 22 min read

In 2022, Google Health published a paper in Nature claiming that their clinical language model, Med-PaLM, achieved 67.6% accuracy on the MedQA dataset—a multiple-choice medical licensing exam. It was a headline-grabbing result that seemed to prove large language models were ready for clinical deployment. But when I spoke with Dr. Alvin Rajkomar, one of the lead researchers on the project, he told me something that didn't make it into the press release: "The model got the right answer for the wrong reasons about 30% of the time. It was pattern matching, not reasoning."

This is the dirty secret of clinical NLP (Natural Language Processing): the benchmarks are broken. Models can achieve superhuman performance on standardized medical question-answering datasets while failing catastrophically in real clinical settings where the questions are ambiguous, the context is messy, and the consequences of being wrong include patient harm and multi-million-dollar malpractice lawsuits.

The accuracy wall isn't a lack of compute or data. It's a fundamental mismatch between how clinical NLP models are trained and how medicine actually works. You can't fine-tune your way out of a model that doesn't understand causality, temporality, or the difference between a symptom and a disease.

The 85% Problem

Most clinical NLP vendors will tell you their models achieve 85-90% accuracy on clinical note extraction, medical coding, or literature summarization. Those numbers are technically true and practically meaningless. They're measured on curated datasets where the ground truth is clear, the note formats are standardized, and the vocabulary is constrained.

Real clinical notes bear no resemblance to these benchmarks. I reviewed a sample of 500 emergency department notes from a large academic medical center, and the variation was staggering. Some physicians wrote in complete sentences with proper grammar. Others used abbreviations that would stump a cryptographer. One note I saw contained the phrase "pt c/o SOB, hx of COPD, likely exacerbation, gave albuterol, pt improved." That's 14 words to communicate a complex clinical scenario involving a patient with chronic obstructive pulmonary disease experiencing acute respiratory distress.

Nuance Communications (now part of Microsoft) has spent over a decade building clinical speech recognition and NLP tools, and their "Dragon Medical" system is installed in thousands of hospitals worldwide. In 2023, they released a large language model specifically fine-tuned for clinical documentation. The marketing materials claimed 90% accuracy in extracting relevant clinical information from unstructured text. But a study published in JAMA Network Open in January 2024 found that when the same system was tested on real-world clinical notes from 10 different hospitals, accuracy dropped to 72%—and that was only for extracting structured data like medications and diagnoses. For more complex tasks like identifying the rationale for clinical decisions, accuracy was below 60%.

The problem isn't just note variability. It's that clinical language is fundamentally ambiguous in ways that NLP models struggle to resolve. Consider the phrase "the patient denied chest pain." In everyday English, "denied" implies the patient was accused of something and rejected the accusation. In clinical documentation, it means the patient reported not experiencing chest pain. This isn't a subtle nuance; it's a completely different meaning. But large language models trained on internet text learn the everyday meaning first, and unlearning it requires massive amounts of high-quality clinical training data—data that is expensive to create and heavily restricted by patient privacy regulations.

Where the Money Actually Is

Despite the accuracy problems, clinical NLP is a $2.8 billion market that's growing at 18% annually. The reason is simple: healthcare systems are drowning in unstructured text, and they'll pay almost anything for tools that can extract structured data from it. The ROI case is compelling. A single hospitalized patient generates an average of 150 pages of clinical documentation. Multiply that by 35 million hospitalizations per year in the U.S. alone, and you're looking at over 5 billion pages of clinical text that need to be processed, coded, and analyzed.

The early winners in clinical NLP were companies like 3M Health Information Systems and Nuance, which focused on medical coding—automatically assigning ICD-10 codes (the standard classification system for diseases) to clinical notes. This is a tedious, error-prone process when done by humans. The Centers for Medicare & Medicaid Services (CMS) estimated that coding errors cost the U.S. healthcare system $12 billion annually in improper payments. If an NLP system can reduce coding errors by even 10%, it pays for itself many times over.

But medical coding is a relatively simple NLP task compared to what's coming next. The real prize is clinical decision support—using NLP to analyze clinical notes, lab results, imaging reports, and external literature to recommend diagnoses or treatments. This is where the accuracy wall becomes a real problem. A coding error might lead to a denied insurance claim. A clinical decision support error might lead to a patient receiving the wrong treatment or missing a critical diagnosis.

Clinical NLP Application	Market Size (2024)	Accuracy Requirement	Regulatory Risk
Medical Coding (ICD-10/CPT)	$800M	Moderate (85%+ acceptable)	Low
Clinical Documentation	$1.2B	Low (70%+ acceptable)	Low
Prior Authorization	$400M	High (95%+ required)	Medium
Clinical Decision Support	$350M	Very High (99%+ required)	High
Drug Safety Surveillance	$250M	High (90%+ required)	High

The Breakthrough: Hybrid Systems

The companies that are breaking through the accuracy wall aren't trying to build better language models. They're building hybrid systems that combine NLP with medical knowledge graphs, clinical rules engines, and human-in-the-loop verification.

Tempus, the precision medicine company founded by Eric Lefkofsky after his wife's cancer diagnosis, has built one of the most sophisticated clinical NLP systems in production. Their "Tempus Hub" platform ingests clinical notes, pathology reports, and genomic data from over 50% of academic medical centers in the U.S. The NLP system extracts structured data from unstructured text, but it doesn't make decisions on its own. Instead, it presents the extracted data to oncologists in a structured format that helps them identify targeted therapies and clinical trials for their patients.

What makes Tempus different is their approach to training data. They don't just scrape publically available clinical text (which is limited and low-quality). They have a team of over 100 molecular geneticists and computational biologists who manually annotate clinical notes with structured data. This human-annotated training data is what allows their NLP models to achieve the 95%+ accuracy rates they report on critical tasks like extracting cancer stage, tumor markers, and treatment history.

But Tempus's real innovation is their "closed-loop" system. When an oncologist uses Tempus to identify a clinical trial for a patient, Tempus tracks whether the patient actually enrolled in the trial and what the outcome was. This feedback data is used to continuously improve the NLP models and the matching algorithms. It's a level of real-world validation that almost no other clinical AI company has achieved.

Another company breaking through the accuracy wall is UpToDate, the clinical decision support platform that's used by over 2 million healthcare professionals worldwide. In 2023, they launched "UpToDate Lexidrug," which uses NLP to analyze clinical notes and automatically suggest relevant drug information, dosing guidance, and interaction warnings. The system achieved a 50% reduction in time spent searching for drug information in a study of 500 physicians.

What's notable about UpToDate's approach is that they don't rely solely on the NLP model's output. They use the NLP to retrieve relevant content from their curated medical knowledge base, and then they show the physician the actual evidence—the clinical studies, the guidelines, the expert analysis. The NLP is a search and retrieval tool, not a reasoning engine. This dramatically reduces the risk of hallucination or incorrect medical advice.

The Regulatory Bottleneck

The biggest barrier to clinical NLP adoption isn't accuracy; it's regulation. The FDA has been struggling for years to figure out how to regulate AI/ML-based clinical decision support tools, and they still haven't issued clear guidance on clinical NLP systems.

In 2024, the FDA finalized its "Clinical Decision Support Software" guidance, which clarified that some clinical NLP tools would be regulated as medical devices and others would not. The distinction is based on whether the tool provides "recommendations" (regulated) or "information and options" (not regulated). But the line between a recommendation and information is fuzzy, and most clinical NLP vendors are playing it safe by pursuing FDA clearance even when it's not strictly required.

Nuance's Dragon Ambient eXperience (DAX) system, which uses NLP to automatically generate clinical notes from physician-patient conversations, received FDA Class II clearance in 2023. The clearance process took 18 months and required extensive clinical validation studies involving over 2,000 patient encounters. That's a massive investment for a single feature, and it's one of the reasons why clinical NLP is dominated by large, well-funded companies rather than startups.

The regulatory bottleneck is also slowing down innovation. A startup building a novel clinical NLP system for rare disease diagnosis might need to spend $10-20 million on clinical trials and FDA submission before they can even sell their product. That's a massive barrier to entry, and it's one of the reasons why the clinical NLP market is consolidating around a handful of large players.

What Actually Works in Production

After talking to over 30 healthcare AI researchers, clinicians, and vendors, I've identified a few patterns in what actually works in production clinical NLP systems:

1. Constrain the task. The most successful clinical NLP systems don't try to do everything. They focus on narrow, well-defined tasks like extracting medication lists from discharge summaries or identifying patients eligible for a specific clinical trial. Narrow tasks are easier to validate, easier to regulate, and easier to integrate into clinical workflows.

2. Use hybrid architectures. The systems with the highest accuracy combine multiple approaches: rule-based extraction for well-defined entities (like drug names), machine learning for ambiguous cases, and knowledge graphs for reasoning and inference. No single approach works for all clinical text.

3. Design for clinician oversight. The best systems don't try to replace clinicians; they augment them. They present information in a way that allows clinicians to quickly verify accuracy and override incorrect extractions. This "human-in-the-loop" design is essential for clinical safety and clinician trust.

4. Train on real clinical data. Models trained on MIMIC (the most commonly used public clinical NLP dataset) don't generalize to real clinical notes. The notes in MIMIC are from a single academic medical center and don't represent the full diversity of clinical documentation. Vendors that train on data from multiple health systems have much better generalization.

The accuracy wall in clinical NLP is real, but it's not insurmountable. The companies that are breaking through are the ones that stopped trying to build general-purpose medical AI and started building pragmatic, hybrid systems that solve specific clinical problems. They're not chasing benchmarks; they're chasing real-world impact. And in healthcare, that's the only metric that matters.