What is Natural Language Processing (NLP)?

When a research study is conducted, the data produced from that study is analyzed once by the study’s investigators, then it is submitted to a public repository. There, it joins hundreds of thousands of existing datasets, gathering dust, almost never to be revisited and reused by any other researcher. Although extensive time and money were invested into the study, this massive collection of valuable data—generated by labs across the world over the past 25 years—is left sitting on the shelf. Even still, these data hold unforeseen opportunities for biomedical researchers to reuse them in new ways to make novel discoveries and further biomedical research.

But a major challenge remains: how does another researcher quickly and comprehensively discover the studies relevant to their question of interest from the ocean of existing studies?

Biomedical research data is not neatly tagged, categorized, described and indexed in a way that it can be easily found and used by other researchers. Instead, it is accompanied by “metadata,” which are plain-language descriptions of what the goals and design of the study are and what the data in the study’s dataset are about. This metadata is often messy, unstructured and incomplete—and it rarely paints a full and accurate picture of what the study was examining.

Therefore, despite the promise these public biomedical data hold, the quality of metadata makes it surprisingly hard for a scientist to simply find the right datasets for a question, let alone reuse the datasets reliably. This is why Arjun Krishnan, PhD, associate professor of biomedical informatics at the University of Colorado Anschutz School of Medicine (CU Anschutz School of Medicine) and leader of the Krishnan Lab, conducts research on Natural Language Processing (NLP).

NLP is a broad area of computer science focused on analyzing and working with natural language data. “Natural language” refers to the spoken and written forms of communication humans use every day (for example, the English language is a form of natural language). NLP uses computational methods to identify patterns in this language and to interpret human communication as accurately as possible. Krishnan and his lab implement advanced NLP tools to help standardize metadata to make data discoverable for further research.

We sat down with Krishnan to discuss NLP and how it is being leveraged in biomedical research.

How would you define Natural Language Processing (NLP)?

Humans write and speak to one another using natural language, which is simply the way in which we organically use words and sentences to communicate with each other.

Natural language processing (NLP) is a field in computer science where we train computers to understand and extract meaning from human language—written or spoken—by identifying statistical patterns in how words appear together and relate to each other.

What are the different types of natural language processing?

There are a lot of different types of NLP, and they have come in waves of different tools.

A lot of what the field and my lab started out with was the kind of NLP that is closer to statistics than machine learning, which means we would look at word frequencies. One of the most common techniques is to say: if a word (e.g., “that”) appears very frequently in general across various documents, it’s probably not an informative word; on the other hand, if a word (e.g., “psoriasis”) appears rarely in general but frequently in a specific document, it’s probably informative.

Then the second wave of tools came along, which is what I would call embedding models. An embedding model basically says: I’m going to take a large corpus of text, and figure out a way to ‘represent’ every word as a set of numbers so that if two words have very similar sets of numbers, they likely mean very similar things. Because words are very hard to use as input to statistical or machine learning models, but numbers are natural inputs for these models, the embedding model’s ability to turn words into numbers is immensely valuable for analyzing text.

Then the next wave, in early 2022 and 2023, is when Large Language Models (LLMs) started coming out. These are what the NLP field calls ‘foundation models’ because they are trained—without any task in mind—on massive amounts of text primarily from the internet, then they are fine-tuned for specific language tasks like question-answering, translation, and summarization. The ‘large’ in LLM refers to billions of knobs and switches (‘parameters’) inside these models that get calibrated during the training process.

In our research, we now use traditional statistical NLP, embedding models and LLMs.

How do you use NLP in biomedical research?

The primary reason we use NLP is to solve the pervasive metadata problem and improve data discovery and data reuse. We didn’t start out with an interest or expertise in NLP. Instead, we had to leverage NLP to address a problem we encountered in our research.

The main focus of our lab is to develop machine learning methods and tools that can leverage massive collections of publicly available data to study mechanisms related to diseases. Scientists have been generating giant biomedical datasets of various kinds over the past 25 to 30 years. These data altogether contain millions and millions of samples—hundreds of thousands of studies that are already there. These public data are actually super valuable, but they’re just sitting there without almost anyone using them. The NIH has already invested a lot of money on all these studies to generate these data. It would be a pity to spend so much money on a study, just use that data only once.

If a researcher wanted to use these existing data for their research, simply finding the right datasets is complicated, because these databases are an ocean of data, and they are growing at an exponential rate. What’s more, many times the metadata that describes the datasets is unstructured and/or incomplete.

For example, when a researcher enters metadata to annotate a dataset, it is not standardized or structured—it’s just plain English. Next, scientists will use jargon or terminology in very different ways. One researcher might upload a dataset and annotate it as data studying “myocardial infarction,” whereas another researcher might upload a dataset and annotate it as data studying “heart attack.” But both datasets are discussing the same thing. This lack of standardization is a challenge.

As another example, a researcher studying psoriasis might upload skin tissue data but never explicitly label it as "skin tissue" in the metadata because they may assume it's obvious from context. But this makes it impossible for other researchers to find all skin tissue datasets when planning a new study.

NLP tools allow us to take the metadata records of existing datasets and annotate the datasets in a standardized way. These annotations then allow any researcher to search old data in an unambiguous way and find those relevant to the biomedical question of their interest.

A solution my lab developed is Txt2Onto, a tool that uses machine learning to automatically read the unstructured text descriptions of datasets and assign standardized labels for tissue type, disease, cell type and experimental protocols. We've now applied Txt2Onto to annotate over a million publicly available samples, making them discoverable for the first time.

How does NLP benefit biomedical research?

This NLP research benefits the entire biomedical research field, because it makes data accessible that might have just gone to waste otherwise.

Our sustained work in this area was recognized in 2022 when my lab received the NIH DataWorks! Prize: Significant Achievement Award for Data Reuse. This award highlights how our methods and tools are helping the broader biomedical community leverage existing public data more effectively.

What are some challenges with NLP in biomedical research?

The first challenge is that the metadata of these biomedical datasets is extremely messy. You can’t simply take 'off-the-shelf' NLP tools, apply them to biomedical data and expect that it will extract meaningful signals from the text. They need to be infused with our own knowledge of biomedical studies and biomedical research design.

This is where Txt2Onto comes in. It works by training biomedical knowledge-informed machine learning models on patterns in biomedical text that are extracted using NLP tools. Txt2Onto models trained this way can automatically extract tissue types, diseases and other key attributes even when they're buried in messy, unstructured descriptions.

The second challenge is scale. For each one of these datasets, we are not just looking at the metadata records, because the metadata is highly incomplete. So, for each dataset, we also must find the paper that was written and published along with the dataset, find the part of the paper that’s relevant for the dataset, then merge them to extract as much information as possible. Completing these workflows across millions of records is a challenge. We used this approach to add high-quality annotations to studies in the Human Microbiome Compendium, a huge resource for the community.

What are possible directions of NLP in biomedical research?

We are currently building partnerships to help expand the use of these tools. We are part of an international consortium called the Generalist Repository Ecosystem Initiative (GREI) which works with data repositories across the world where scientists deposit their data. We're also part of a large international team that was selected as a Finalist in Phase 1 of the NIH Data Sharing Index ("S-Index") Challenge. This initiative aims to create metrics and frameworks to improve how biomedical data is shared and described, which will make the metadata problem we're solving even more tractable at scale.

We are now working to apply the metadata NLP tools to all of the datasets in these repositories to make sure they are all completely standardized and harmonized so that any researcher can find the datasets that are relevant to their question of interest. By making decades of research data discoverable and reusable, these tools have the potential to accelerate biomedical discoveries without requiring a single new experiment—we're just finally able to use what we already have.

Featured Experts

Arjun Krishnan, PhD

Department of Biomedical Informatics

What is Natural Language Processing (NLP)?

Understanding the basics of NLP and how NLP is making biomedical data reusable at scale.

Department of Biomedical Informatics

Related Articles

What is Natural Language Processing (NLP)?

Health AI in 2026: CU Researchers are Implementing Trustworthy Tools to Support Clinicians

2025 in Review: CU Anschutz Breakthroughs Shaping the Future of Health and Science