There’s a lot of buzz around artificial intelligence (AI) these days, with claims like, “AI will replace doctors,” “AI understands all clinical texts perfectly,” “large language models (LLMs) can diagnose like physicians,” and “LLMs will eliminate administrative tasks instantly.”
While AI certainly has the potential to transform healthcare, many of these claims are exaggerated. The truth is, AI is here to enhance the work of clinicians, not replace them. AI struggles with complex, long-term patient histories, and while LLMs can support decision-making, they still require careful handling of uncertainty. In short, LLMs can be helpful, but they still need validation and safety checks to ensure accuracy and reliability.
The University of Colorado Anschutz Medical Campus (CU Anschutz) recently hosted its inaugural conference on AI, titled "Engaging with AI." This event provided a platform to explore how AI is transforming research, education, and collaboration across various disciplines. Casey Greene, PhD, founding chair of the Department of Biomedical Informatics (DBMI), remarked, "At DBMI, we take great pride in being the academic foundation at CU Anschutz, pioneering initiatives that seamlessly integrate cutting-edge technology with an unwavering commitment to advancing clinical excellence."
During the event, Yanjun Gao, PhD, assistant professor of biomedical informatics at the University of Colorado School of Medicine, presented groundbreaking research from the Language, Reasoning, Knowledge (LARK) Lab. Her presentation focused on the evaluation of large LLMs in clinical practice, highlighting their potential to revolutionize medical diagnostics and patient care. Gao also shared intriguing insights into ongoing initiatives aimed at integrating AI advancements into clinical workflows, emphasizing the transformative impact these technologies could have on the future of healthcare.
Streamlining Physician-Patient Communication
Physicians often receive a large volume of patient messages, and replying to them can be incredibly time-consuming. To address this, Microsoft and OpenAI partnered with Epic to integrate ChatGPT-4 into their EHR system in 2023. One of the first uses of this technology was to automate responses to patient messages. However, this integration came with its own set of challenges, particularly around "prompt engineering." Crafting effective prompts is key to getting accurate and reliable responses from a LLM, but it's not intuitive; it is a process that involves trial and error. It requires experimenting with different phrasings and formats to find the best way to guide the model toward useful answers. A well-structured prompt for humans does not always work for AI, since LLMs interpret language differently. Moreover, communication style varies across medical specialties, making it difficult to create a one-size-fits-all prompting strategy in healthcare.
To address this, Gao and her team developed Cliniciprompt, a software framework designed to help non-technical users, particularly in healthcare, automatically generate effective prompts. Cliniciprompt offers a step-by-step guide for creating prompts and includes built-in metrics to ensure compatibility with the LLMs being used. It also features a tool called “Retrieval-augmented generation,” which stores successful prompt examples for reuse. Users simply input their desired outcome, and Cliniciprompt generates the appropriate prompts to guide the LLM.
Since its rollout in February 2024, Cliniciprompt has had a substantial impact. Nurses have increased their usage rate to around 90%, while physicians have reached about 75%. These improvements demonstrate a significant boost in the effectiveness of AI-driven message replies. Cliniciprompt has also gained national recognition, with invitations to present at major conferences like AMIA and the Epic User Group Meeting.
Predicting the Unpredictable
One major challenge in medicine is dealing with uncertainty. Physicians constantly navigate uncertainty when making diagnoses, weighing factors such as pretest probability (the likelihood a patient has a condition before testing) and post-test probability (the likelihood after test results). Navigating this uncertainty is critical to avoiding misdiagnoses. Gao’s research delves into how large language models (LLMs) can aid in predicting pretest diagnosis probability, evaluating their reliability in supporting diagnostic decision-making. The question is whether LLMs can accurately estimate the likelihood of a patient having a specific condition based on their medical history and other factors. Gao explained, “We’re investigating whether the internal representations inside LLMs, such as the probability distributions over possible words, can be used to estimate uncertainty.”
The team has compared LLM predictions with state-of-the-art machine learning algorithms for diagnosing conditions like sepsis, arrhythmia, and congestive heart failure. Gao noted, “We’ve found correlations between the LLM’s internal representations and the predictions from traditional machine learning models, but there are instances where LLMs struggle with uncertainty estimation.” Moreover, bias in LLM predictions remains a critical concern, particularly when demographic factors such as race and sex influence AI prediction. Addressing these challenges is essential before deploying AI for high-stakes medical decision-making.
Data Summarization and Safe Clinical Applications
Beyond diagnostic support, Gao’s lab is investigating two additional projects aimed at enhancing AI in healthcare. One project explores how LLMs can summarize extensive and complex patient data to help reduce cognitive overload for clinicians and bridge communication gaps between providers and patients through tailored summaries. Despite their training on vast amounts of text and is designed to process long text, LLMs struggle with summarizing longitudinal medical records effectively, with errors like hallucination and omission. Gao’s lab is exploring novel approaches to improve this capability and ensure critical medical insights are accurately captured and meaningfully presented to support utility. You can learn more about that study here. The second project focuses on evaluating the safety of LLMs in clinical tasks, particularly in generating clinical text. Many successful LLMs today are trained through a method called reinforcement learning from human feedback (RLHF), which aligns the AI with human preferences. Gao explained, “We are assessing whether these methods can ensure that LLMs align with human values when generating and evaluating clinical text, especially in healthcare contexts,” she added, “While we don’t have all the answers yet, we are actively working on these critical questions.” You can learn more here.
Gao emphasized, “AI in healthcare is not just hype. There are many exciting research opportunities in this field, but to realize their full potential, collaboration is essential.” While Gao’s lab is relatively new, they are bridging the gap between technical expertise and clinical practice.
Greene stated, "Dr. Gao’s work is a prime example of how responsible AI advances can enhance, rather than replace, the critical human elements of healthcare. By developing tools like Cliniciprompt and rigorously exploring uncertainty in AI-driven diagnostics, her team is helping shape the future of patient care.