Department of Biomedical Informatics

Health AI in 2026: CU Researchers are Implementing Trustworthy Tools to Support Clinicians

Written by David DeBonis | January 16, 2026

In the past three years, Large Language Models (LLMs) have become increasingly relevant across the globe. With the advent of chatbots like Open AI’s Chat-GPT, Google’s Gemini and Microsoft’s Copilot—along with many other tools—these AI tools are accessible (and often user-friendly) to practically anyone with a computer and internet access. 

Researchers in the Department of Biomedical Informatics (DBMI) at the University of Colorado Anschutz (CU Anschutz) are leveraging LLMs to improve healthcare—seeking to improve the experience of both patients and clinicians. 

What are Large Language Models (LLMs)

LLMs are a form of AI software that process and produce natural-language data (the data from spoken and written human languages such as English, Chinese or Spanish). These LLMs are built and “trained” with specific machine learning models that first ‘read’ and process a massive amount of text data; they then leverage statistical modeling to predict the next word in a sentence given the context of the previous words.

Once these LLMs have a high probability of predicting the next word, they are then fine-tuned to perform specific tasks such as chatting with a user (e.g. Chat GPT), translating text (e.g. Google Translate), or assisting with code (e.g. GitHub Copilot).

How LLMs fit into healthcare

LLMs are making their way into healthcare, too. For example, LLMs are being used to summarize the electronic health records (EHRs) of patients to save time for clinicians. LLMs are also being used to help clinicians with diagnostics. However, this introduction of LLMs into healthcare raises a poignant question: can we trust these AIs to support clinicians and patients?  

Yanjun Gao, PhD, assistant professor of biomedical informatics at CU Anschutz, thinks the answer is more nuanced than a simple “yes” or “no.” Gao and her team in the Language, Reasoning, and Knowledge Lab (LARK Lab) are researching how these tools can be effectively deployed in medical settings. The central question guiding Gao’s research is: how can we make AI systems that are not only helpful for clinicians, but also safe, trustworthy and aligned with human needs?

How can LLMs help clinicians make better decisions?

A majority of the research Gao and her team are currently working on revolves around supporting clinicians to make better decisions—especially with diagnosis. Gao explained that “diagnostic errors have been a national priority in healthcare. In fact, large national studies suggest that diagnostic errors occur in roughly 20–25% of patient records. The reason we have so many diagnostic errors is because medicine—by itself—is very difficult.

“Clinicians are asked to make life-critical decisions under uncertainty using probabilistic thinking. A 30% chance of sepsis is not the same as a 30% chance of diabetes, but estimating and acting on those risks is incredibly hard, even for experts.” - Yanjun Gao, PhD, assistant professor of biomedical informatics at CU Anschutz

When a patient enters a hospital and a clinician starts to screen them, the clinician will come up with an initial hypothesis of what might be going on. For example, if a patient has sepsis, it could be caused by a number of things: pneumonia, skin infections or even COVID-19, as examples. Among all of these possible causes—along with all of the other variables involved—the clinician must answer:  

  • What is the most likely path that led to the patient’s current signs and symptoms? and
  • What next actions (tests, treatments, etc.) are most likely to lead the patient to a good outcome?

“It’s a very difficult problem to solve,” said Gao. “So, we are trying to bring LLMs into this space—not to replace doctors, but to help them better diagnose patients.”

One example of how LLMs can support diagnosis is consolidating and listing all of the relevant evidence from the patient’s EHRs, allowing the clinician to consume the information quickly. Although EHRs were originally designed to improve efficiency in the clinical workflow, they can actually become information-overflow for clinicians. These EHRs often contain thousands of pages of lab results and vital signs, clinical notes and more. It’s often very difficult for clinicians to navigate this overwhelming amount of information and find the relevant evidence. Finding this evidence is a problem that clinicians are trying to solve for, and LLMs may be a strong candidate.

Challenges with LLMs in healthcare

Navigating uncertainty with LLMs

As mentioned previously, when a patient enters a hospital, there are countless possibilities of what could be going on. For clinicians who have been trained extensively, it’s very normal to be navigating this uncertainty to find the most likely hypothesis. In the same way that a clinician is navigating this medical uncertainty, LLMs in healthcare settings are also navigating this uncertainty.  

With this, in order to make LLMs reliable and trustworthy for clinicians, the LLMs must have two capabilities: first, they need to be able to process clinical evidence accurately and improve its output based on that; second, they need to be able to present output that lets humans understand why LLMs are presenting the output they are.   

“We want to assure that when LLMs present a predicted diagnosis, they can accurately convey why they made this prediction and how confident they are in their prediction,” said Gao.   

This process of quantifying confidence of predictions is a form of uncertainty estimation (or uncertainty quantification), and it is part of what Gao and the LARK Lab are researching. For example, in one recent work published at the International Conference on Empirical Methods on Natural Language Processing, the team proposed a novel way to help LLMs quantify uncertainty more accurately. In the paper, the team introduced MUSE (Multi-LLM Uncertainty via Subset Ensemble), one of the first multi-LLM uncertainty algorithms grounded in information-theoretic principles. Rather than relying on a single model’s output, MUSE identifies a trusted subset of models whose predictions are most reliable for a given clinical question, leading to better-calibrated confidence estimates than those produced by any single LLM alone. The team further explores Bayesian approaches—a method of statistics leveraged in Machine Learning—to transfer MUSE’s calibrated uncertainty into the fine-tuning of a single LLM, improving its reliability and calibration.   

Tools like MUSE will support the use of LLMs in diagnostic decision making by helping clinicians to understand the confidence of an LLM’s prediction. An AI system could sound very confident while being profoundly wrong—which is dangerous in medicine. Tools like MUSE could help clinicians know when to trust AI, and when not to.  

AI Hallucinations

Another challenge that arises with bringing LLMs into healthcare is what’s often referred to as AI hallucinations. These “hallucinations” are errors that appear when an LLM does not have enough evidence to predict something, so instead it fabricates information that is not drawn from its source input. “Trying to mitigate these hallucinations has been a very large issue in the field of AI,” said Gao.

Biases in LLMs

In addition to uncertainty and AI hallucinations, models can inherit biases from the massive collections of data used in their pretraining. Much of this data is pulled from public internet data sources, which lends itself to inherit bias. These biases—both explicit and implicit—become encoded in the models’ internal representations and can influence downstream predictions. 

For example, Gao and the LARK Lab found that adding demographic data such as ethnicity or sex can actually flip an LLM's predicted diagnosis on the same patient, even though other vital signs, laboratory results and the majority of the clinical evidence remained unchanged. This reveals how LLMs could unintentionally prioritize demographic cues over vital signs, laboratory data or other medically relevant information—posing serious risks when models are used for clinical decision support.  

“It would be extremely dangerous if an LLM were to ignore important clinical evidence such as the laboratory results and the vital signs, and instead focus on potentially biased cues such a patient’s ethnicity,” emphasized Gao. 

Designing AI systems that are safe, trustworthy and aligned with human needs

Despite the challenges of integrating LLMs into healthcare environments, Gao and the LARK Lab recognize that their adoption is inevitable—and can surely improve the medical field. Their work therefore focuses on developing approaches to ensure these systems are introduced in ways that are safe, trustworthy and aligned with human needs.

Addressing bias in LLMs

To address the bias that can show up in large language models, the LARK Lab is studying what’s happening inside these systems when they make decisions. Instead of retraining the entire model from scratch—which is extremely expensive and time‑consuming—they look at which parts of the model “light up” when it sees biased information versus unbiased information. “It’s kind of like we are doing brain surgery to correct these biases,” said Gao. 

In a recent preprint, Gao’s team took this mechanistic approach to bias in LLMs, treating the model like a brain and opening it up to examine individual neurons. They successfully pinpointed specific neurons that encode stereotype-related demographic information inherited from pre-training. They also attempted an ‘LLM surgery’ by manually suppressing specific neurons to see whether biased behaviors are reduced. Although they found that this intervention does lessen certain biases, the effects are incomplete. This underscores how deeply embedded in LLMs these biases are, and how difficult they are to fully remove.

Emphasizing multidisciplinary collaboration

Although there is a lot of computer science involved in integrating LLMs into healthcare, it requires much more than software and mathematics. For this reason, Gao and the LARK lab emphasize multidisciplinary collaboration, especially between researchers and clinicians.

Clinicians can provide crucial context about real‑world patient care, workflows, and the practical constraints that shape medical decision‑making—insight that researchers would not have access to on their own. 

“If we want to make positive change with AIs, we need all of the stakeholders involved to also collaborate in developing these tools,” Gao said. “I myself cannot be working alone without the help of my clinical partners. It's only because of them that I understand the problem better, and it's only because of our work together that we can come up with solutions that are actually targeted to the problem at the bedside.”