Lessons learned building natural language processing systems in health care

NLP systems in health care are hard—they require broad general and medical knowledge, must handle a large variety of inputs, and need to understand context.

By David Talby

March 7, 2019

Words (source: Steve Johnson on Flickr)

We’re in an exciting decade for natural language processing (NLP). Computers will get as good as humans in complex tasks like reading comprehension, language translation, and creative writing. Language understanding benefits from every part of the fast-improving ABC of software: AI (freely available deep learning libraries like PyText and language models like BERT), big data (Hadoop, Spark, and Spark NLP), and cloud (GPU’s on demand and NLP-as-a-service from all the major cloud providers).

In health care, several applications have already moved from science fiction to reality. AI systems passed the medical licensing exams in both China and England—doing better than average doctors. A new system diagnoses 55 pediatric conditions better than junior doctors. These systems are harder to build than some of the first computer vision deep learning applications (i.e., study one image)—they require a broader general and medical knowledge, handle a bigger variety of inputs, and must understand context.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

I’ve been lucky to be involved in building NLP systems in health care for the past seven years. The goal of this article is to share key lessons I learned along the way to help you build similar systems faster and better.

Meet the language of emergency room triage notes

Many people, me included, make the mistake of assuming that clinical notes in the U.S. are written in English. That happens because that’s how doctors will answer if you ask them what language they use. However, consider this example of three de-identified triage notes taken from emergency room visits:

Triage Notes

states started last night, upper abd, took alka seltzer approx 0500, no relief. nausea no vomiting

Since yesterday 10/10 “constant Tylenol 1 hr ago. +nausea. diaphoretic. Mid abd radiates to back

Generalized abd radiating to lower x 3 days accompanied by dark stools. Now with bloody stool this am. Denies dizzy, sob, fatigue.

Most people without a medical education do not understand the meaning of these typical sentences. Here are a few things to note:

None of these sentences are grammatically correct sentences in English.
None of them use the words “patient” or “pain.” They don’t have a subject.
They use a lot of jargon: 10/10 refers to the intensity of pain. “Generalized abd radiating to lower” refers to general abdominal (stomach) pain that radiates to the lower back.

ER doctors I’ve shown these notes to, though, consider them useful—they’re concise and focus on what matters. They would consider these as common and not “bad” examples of ER triage notes.

Yes, emergency rooms have their own language

As a philosopher or linguist, you might argue that this still does not constitute a “different language” in the typical sense of the word. However, if you’re a data scientist or NLP practitioner, there shouldn’t be any doubt that it is:

It has a different vocabulary. The Unified Medical Language System (UMLS) includes more than 200 vocabularies for English alone, covering more than three million terms. In contrast, the Oxford English Dictionary of 1989 had 171,476 words (although, that should be roughly tripled to include derivatives that UMLS directly lists).
It has a different grammar. The text has its own definition of what sentences are and what parts of speech are. Statements like “+nausea” and “since yesterday 10/10” are grammatical structures that don’t exist anywhere else.
It has different semantics. “Sob” means “shortness of breath” (and not the other meaning you had in mind). “Denies” means the patient says they don’t have the symptom, although the clinician thinks they might.
It goes beyond jargon. Jargon refers to the 100-200 new words you learn in the first month after you join a new school or workplace. In contrast, understanding health care language takes people as long as it takes to master day-to-day Italian or Portuguese.

Lesson #1: Off-the-shelf NLP models don’t work

In practice, off-the-shelf NLP libraries and algorithms built for English will fail miserably on this “different language” in the health care industry. Not only will named entity recognition or entity resolution models fail, but even basic tasks such as tokenization, part of speech tagging, and sentence segmentation will fail for the majority of sentences.

If you don’t believe me, feel free to test it yourself with the six popular NLP cloud services and libraries listed below. All but Amazon Comprehend provide a web user interface so you can copy and paste sentences to see how the service would analyze it:

In a test done during December 2018, of the six engines, the only medical term (which only two of them recognized) was Tylenol as a product.

Health care has hundreds of languages

The next mistake I made, like many others, was building models that “solve health care.” Amazon’s Comprehend Medical is now taking this approach with a universal medical-NLP-as-a-service. This assumes that health care is one language. In reality, every sub-specialty and form of communication is fundamentally different. Here’s a handful of de-identified examples:

Pathology (Surgical pathology, cancer):

Part #1 which is labeled “? metastatic tumor in jugular vein lymph node” consists of an elliptical fragment of light whitish-tan tissue which measures approximately 0.3 x 0.2 x 0.2 cm.

Radiology (MRI Cervical Spine):

C6-7: There is a diffuse disc osteophyte which results in flattening of the ventral thecal sac with a mild spinal canal stenosis and moderate to severe bilateral neural foraminal narrowing. OTHER FINDINGS: No paraspinal soft tissue mass.

Pre-authorization:

Based on the outcome of the Phase I trial, the patient will receive permanent implantation of the stimulator. Specifically, this patient will receive a Spinal Cord Stimulator System, made by Boston Scientific Neuromodulation Corporation. This SCS System includes a re-chargeable battery within the implanted stimulator, allowing the physician and patient to control pain at the most optimal settings without compromising battery life compared to non-rechargeable SCS systems. The Boston Scientific SCS System is FDA-approved.

Postop (from “Objective” section of a SOAP note):

Vitals- Tmax: 99.8, BP- 128/82, P- 82, R-18 I/O- 3000ml NS IV / 200ml out via foley, 800ml on own, in past 24 hours

General- laying in bed, appears comfortable

Skin- Surgical incision margins have minimal erythema and are well approximated with staples, no dehiscence, no drainage. No signs of hematoma or seroma formation. No jaundice

Dental (Anesthetic, Specific Tooth):

Benzocaine was placed on the palate, adjacent to tooth 1. A total of 0 .00 carpules of Articaine, 4% with Epinephrine 1:100,000 was injected into the palate using a long, 25-gauge needle.

Medications (Dosage, route, frequency, duration, form)

aspirin is required 20 mg po daily for 2 times as tab

Need more examples? Take some time to learn about deciphering your lab reports. Or consider that medical students starting a specialty in dermatology need to master the aptly named Dermatology—learning the language. Even Identifying Patient Smoking Status from medical discharge records is complex enough to be an active area of academic research.

Then, there are many variants within each medical specialty. For example, deciding whether or not to approve a pre-authorization request for an MRI versus, say, an implantable spinal cord stimulator requires extracting completely different items from the pre-authorization forms. As another example, within pathology, different terms are used to discuss different types of cancer. This has a real-world impact: the company I work for is undertaking a project that requires training separate NLP models for extracting facts about lung, breast, and colon cancer from pathology reports.

Amazon’s Comprehend Medical has, so far, only focused on normalizing medication values (see that last “aspirin” example in the above table). The service also comes with standard medical named entity recognition—which doesn’t address any specific application’s needs. Please do not take my word for it—try it yourself on the examples above or on your own text. Such NLP services are mostly used nowadays as a means to attract customers into professional services engagements. Other companies like 3M and Nuance that sell “health care NLP” are more up front about this in their marketing.

Lesson #2: Build trainable NLP pipelines

If you need to build NLP systems in health care yourself, you’ll need to train NLP models that are specific to the application you’re building. This doesn’t mean you cannot reuse existing software—there is a lot you can reuse:

Reuse:

Medical terminologies

Medical embeddings

Neural network graphs

NLP Pipeline API’s

Training & inference framework

To build:

What medications is this patient taking?

Does this patient require a chest CT scan?

What’s the right E/M billing code for this visit?

Has this patient been pregnant before?

Do they have known allergies?

When we built Spark NLP for Healthcare—an extension of the open source NLP library for Apache Spark—the goal was to provide as many reusable out-of-the-box components as possible. These include, for example, production-grade implementations of the state-of-the-art academic papers for clinical named entity recognition and de-identification, biomedical entity normalization, and assertion status (i.e., negation) detection. Using these implementations doesn’t require learning to use TensorFlow (or any other framework), since the deep learning framework is embedded in the library under easy-to-use Python, Java, and Scala APIs. The library itself is a native extension of Spark ML and reuses its Pipeline class for building, serializing, and extending NLP, ML, and DL flows.

Making this library perform in real-world projects taught us a lot about just how different “health care languages” are from human ones. Here are some of the things we had to build:

Deep learning-based sentence segmentation. While splitting sentences in Wikipedia articles often can be done just using regular expressions, handling multi-page clinical documents was a bigger challenge. In particular, the algorithms had to deal with headers and footers, lists, enumerations, call-outs, two-column documents, and other formatting.
Health care-specific part-of-speech tagging. Not only was a different model required, but additional parts of speech are used for health care models. This was done because it actually improves the accuracy of clinical named entity recognition.
Health care-specific entity normalization algorithms. Named entity recognition by itself is often useless in practice: annotating from “both eyes seem to be infected” that “eye” and “infection” are medical terms doesn’t help much. In contrast, marking the whole chunk of text as code 312132001 from the standard SNOMED-CT clinical terminology, while normalizing for the different ways to describe the same finding, is much more useful. It enables your application to base business logic based on this code, no matter how it was normalized or how, exactly, it was expressed in the free-form text it came from.

In short: the deeper we go into treating health care texts as different languages, the closer we get to matching and exceeding human accuracy on the same tasks.

Lesson #3: Start with labeling ground truth

So, how do you start your own project? How do you know how far off you are and whom to trust? One way is to start by building a labeled validation set. For example, if you are interested in automating ICD-10 coding from outpatient notes, have clinicians define a representative sample of such records, de-identify them, and have professional clinical coders label them (by assigning the correct codes). If you are interested in extracting critical events from radiology reports or missed safety events from in-patient notes, have clinicians define the sample and label them correctly first.

This will often uncover blockers you need to address before involving (and wasting the time of) your data science team. If you don’t have access to enough data, or can’t de-identify it at scale, then there’s no way to build a reliable model anyway. If clinicians cannot consistently agree on the correct labels in some cases, then the first problem to solve is to agree on clinical guidelines instead of involving data scientists to try to automate a disagreement. Finally, if you find you’re facing highly unbalanced classes (i.e., you are looking for something that happens to a handful of patients per year), it may be wise to change the definition of the problem before calling in the data scientists.

Once you have a representative and an agreed upon and correctly labeled validation set, you can start testing existing libraries and cloud providers. Most likely, the first test will immediately uncover the gaps between each offering and your needs. The smartest teams we’ve worked with have set up week-long or two-week-long test projects, in which the goal is to use a library or service to reach the maximum level of accuracy for your specific needs. Doing this enables you to evaluate how easy each service is to train custom models, define domain-specific features and pipeline steps that your solution requires, and explain the results back to you.

Such an approach can be a great education opportunity for your team. It tests both the packaged software and the support/consulting aspects of the services you’ll evaluate. It will show you how far you are from achieving a level of accuracy that’s in line with your business needs. Finally, this third lesson enables you to validate lessons #1 and #2 on your own, without taking my word for them.

Best of luck and success in your projects. Since this is health care we’re talking about, the world needs you to succeed!

Related resources:

An upcoming Strata Data London tutorial: “Natural language understanding at scale with Spark NLP”
“Lessons learned turning machine learning models into real products and services”
David Talby on “Building a natural language processing library for Apache Spark”
“Comparing production-grade NLP libraries”
Maryam Jahanshahi on “Using machine learning and analytics to attract and retain employees”
Alan Nichol on “Using machine learning to improve dialog flow in conversational applications”
David Ferrucci on why “Language understanding remains one of AI’s grand challenges”
David Blei on “Topic models: Past, present, and future”

Post topics: Data