Department of Biomedical Informatics

What is Natural Language Processing (NLP)?

Written by David DeBonis | January 28, 2026

When a research study is conducted, the data produced from that study is analyzed once by the study’s investigators, then it is submitted to a public repository. There, it joins hundreds of thousands of existing datasets, gathering dust, almost never to be revisited and reused by any other researcher. Although extensive time and money were invested into the study, this massive collection of valuable data—generated by labs across the world over the past 25 years—is left sitting on the shelf. Even still, these data hold unforeseen opportunities for biomedical researchers to reuse them in new ways to make novel discoveries and further biomedical research.  

But a major challenge remains: how does another researcher quickly and comprehensively discover the studies relevant to their question of interest from the ocean of existing studies? 

Biomedical research data is not neatly tagged, categorized, described and indexed in a way that it can be easily found and used by other researchers. Instead, it is accompanied by “metadata,” which are plain-language descriptions of what the goals and design of the study are and what the data in the study’s dataset are about. This metadata is often messy, unstructured and incomplete—and it rarely paints a full and accurate picture of what the study was examining.  

Therefore, despite the promise these public biomedical data hold, the quality of metadata makes it surprisingly hard for a scientist to simply find the right datasets for a question, let alone reuse the datasets reliably. This is why Arjun Krishnan, PhD, associate professor of biomedical informatics at the University of Colorado Anschutz  School of Medicine (CU Anschutz School of Medicine) and leader of the Krishnan Lab, conducts research on Natural Language Processing (NLP). 

NLP is a broad area of computer science focused on analyzing and working with natural language data. “Natural language” refers to the spoken and written forms of communication humans use every day (for example, the English language is a form of natural language). NLP uses computational methods to identify patterns in this language and to interpret human communication as accurately as possible. Krishnan and his lab implement advanced NLP tools to help standardize metadata to make data discoverable for further research.

We sat down with Krishnan to discuss NLP and how it is being leveraged in biomedical research.