<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=799546403794687&amp;ev=PageView&amp;noscript=1">

What is a Large Language Model (LLM)?

How large language models (LLMs) and AI in healthcare, NLP, and clinical documentation are transforming healthcare and clinical workflows.

minute read

by Melinda Lammert | August 5, 2025
Healthcare professional typing on a laptop with AI-powered large language model (LLM) software for clinical documentation and NLP tasks.
What you need to know:

Quick Definition: What Is a Large Language Model? 

Large language models (LLMs) power today’s most advanced AI tools, from chatbots and search assistants to AI-powered clinical workflows and transformer-based generative AI models used in healthcare documentation. 

Large language models (LLMs) are a type of generative AI, built using advanced machine learning techniques within the field of natural language processing (NLP). These systems analyze vast amounts of text data to learn patterns and context, allowing them to generate coherent, human-like language across diverse applications, from chatbots to clinical documentation and AI-assisted decision-making. 

 “LLMs can imitate aspects of human thought, but they lack the emotional and experiential depth that defines human intelligence.” – Yanjun Gao, PhD, expert in generative AI and clinical NLP research 

LLMs are trained on massive amounts of text, typically scraped from the internet, so they can answer questions, summarize information, generate content, write stories and assist with complex tasks like coding or medical documentation. These models are transforming how we interact with technology, making it more natural, intuitive and AI-assisted.  

AI-assisted LLM summarization tools help clinicians digest massive patient data quickly and reduce cognitive overload. -  Yanjun Gao, PhD, expert in generative AI and clinical NLP research 

Popular Large Language Models (LLMs) and How They Learn Language Patterns

Some of today’s most well-known state-of-the-art large language models include GPT-4o, Claude 3, Gemini 1.5, and LLaMA 3. These transformer-based generative AI models generate human-like text by learning patterns in massive datasets. 

What sets LLMs apart is their ability to learn from examples. They are trained on enormous datasets, which can include:

  • Books

  • Scientific articles

  • Websites

  • Medical literature

  • Public forums

Instead of “thinking” like humans, LLMs generate responses by recognizing language patterns in data and statistically predicting the next word in a sequence. This allows them to summarize information, answer questions, write stories, and assist with complex tasks across industries, from coding to healthcare documentation. 

How Do Large Language Models Work?

At the core of every modern large language model is the Transformer architecture, introduced in 2017. These deep learning sequence prediction models allow LLMs to process and generate human-like text efficiently.

The Transformer Architecture Explained

The Transformer architecture is the core technology behind tools like ChatGPT, and you can think of it like a brain made up of layers with billions of neurons that "light up" in different ways depending on the task, whether it's writing a story or solving a math problem.

These models learn by reading huge amounts of text, which helps them understand how language works and how to respond appropriately. Over time, they develop internal neural patterns, like mental circuits, that specialize in different types of thinking, such as answering questions or summarizing information. Just like different parts of the human brain handle different tasks, researchers have found that different parts of a Transformer model activate for different types of language tasks, suggesting internal specialization in NLP tasks.

How Are Large Language Models Trained?

Large language models are trained in three main stages, similar to teaching a student. This training pipeline ensures accuracy, safety, and alignment in generative AI applications. 

1. Pretraining

The model reads massive amounts of text and learns by predicting missing words.

Example:

"The heart pumps _____ through the body."

The model learns to predict "blood."

Through billions of such predictions, it learns grammar, context and patterns in natural language.

2. Instruction Tuning

Next, the model practices following human directions by studying examples of questions and appropriate responses.

For example:

  • "Summarize this article."

  • "Explain diabetes to a 10-year-old."

  • "Write a discharge summary."

This stage improves the model's helpfulness and responsiveness in real-world AI tasks.

3. Reinforcement Learning

Human reviewers evaluate the model’s answers and give feedback to help it improve, ensuring it responds in ways that are helpful, safe and aligned with human values. This reinformcement learning from human feedback (RLHF) process is critical for AI safety and responsible deployment.

Together, these stages enable LLMs to generate responses that are context-aware, natural, and increasingly useful across industries, from coding to AI-powered clinical documentation. 

How Large Language Models Are Transforming Healthcare Research at CU Anschutz

Healthcare is one of the fastest-growing areas of large language model research due to the volume, complexity and sensitivity of clinical data.

Understanding how large language models work is only part of the story. Researchers are exploring how generative AI and NLP-powered tools can safely augment clinical workflows, providing actionable insights from complex electronic health record (EHR) data and reducing cognitive load for healthcare providers. 

At the University of Colorado Anschutz Medical Campus, scientists are exploring how LLMs can reduce clinician burnout, summarize patient records, generate clinical notes and assist in patient diagnosis, all while reducing cognitive load for providers.

To learn more, we spoke with Yanjun Gao, PhD, assistant professor of biomedical informatics. 

Q&A With Yanjun Gao, PhD

In your view, how are large language models different from traditional AI systems?

Confusion between AI and LLMs is common, Gao said.

"Many people often confuse LLMs with AI, but LLMs are a specific type within the broader AI category. AI encompasses a wide range of systems that vary based on their input types, tasks and underlying architectures. As discussed before, what sets LLMs apart is that they are primarily trained on massive amounts of text data. Their core training method involves predicting the next word in a sentence, which helps them learn language patterns and contextual meaning. LLMs, known as multimodal models, can also handle images, speech or video, but text is still their primary focus. For example, ChatGPT is great at writing and answering questions, while tools like DALL·E are designed to create images from text prompts."

Many people wonder about privacy. Does sharing personal information with tools like ChatGPT change how the model behaves or learns? Gao said it’s important to distinguish between personalization and retraining.

"While the exact inner workings of models like ChatGPT aren’t publicly disclosed, since companies like OpenAI treat them as proprietary, there’s a general understanding of how personalization might work."

She emphasized that individual users are not retraining the base system. "Although the core model itself isn’t retrained for each user, it can adapt to individual preferences through interaction."

She compared it to everyday human memory. "For example, while your personal data doesn’t go into retraining the model itself. Instead, the system may remember your preferences during a session to make the conversation smoother, like a barista remembering your coffee order. This personalization is how humans remember past conversations and tailor future ones, reinforcing the idea that these models mimic certain aspects of human cognition."

Can large language models truly think and reason like humans? "

Yes and no," said Gao.

"LLMs can perform certain tasks like solving some types of math problems or writing code at or above human levels, which suggests they have capabilities that superficially resemble aspects of human cognition, like pattern matching and composing solutions through multiple steps. However, whether they truly think like humans is still an open question."

She pointed to recent research that raises questions about reasoning depth.

"Recently, Apple published the Illusion of Thinking paper, showing that LLMs can struggle with symbolic reasoning under certain conditions. While I believe this critique highlights important limitations, I also see evidence that LLMs can learn useful reasoning heuristics from data."

Gao said the fundamental difference lies in lived experience.

"Crucially, what distinguishes humans from LLMs is that our thinking is shaped not only by logic but also by emotions, personal experiences, and embodied perceptions. While LLMs are trained on vast amounts of text and can mimic emotional language, they don’t actually feel emotions, because true emotion arises not just from the brain but from the entire human body, our nervous system, heart and physical sensations. So, while LLMs can imitate aspects of human thought, they lack the emotional and experiential depth that defines human intelligence."

What impact are large language models already having in healthcare?

"One of the most common real-world applications of LLMs in healthcare today is automated message drafting within electronic health record (EHR) systems, like Epic’s MyChart. The model, such as GPT-4, generates responses to patient messages in this setup. These AI-generated drafts are then reviewed by physicians or nurses, who either approve them as-is or make minor edits before sending. This helps streamline communication between patients and healthcare providers, saving time while keeping a human in the loop for oversight and accuracy."

How are you using large language models in your own research?

"My lab is exploring the application of LLMs in the summarization of EHRs. This is especially valuable in addressing clinician burnout, which is often caused by the overwhelming volume of patient data, particularly in complex cases like elderly patients in the ICU, where new information is generated constantly. Summarization tools help by condensing large amounts of structured data (like vitals and lab results) and unstructured data (like physician notes) into concise, readable summaries. These summaries highlight key diagnoses and developments over a specific time frame, such as the past 48 hours, making it easier for clinicians, especially those coming on shift, to understand a patient’s status quickly. Companies like Epic are actively developing LLM-based summarization systems to improve clinical workflow efficiency and reduce cognitive overload for healthcare providers."

How do you determine whether a large language model is ready to be deployed in a hospital setting?

Deployment decisions depend on both performance and understanding limitations, Gao said. "In a recent study accepted to the Association for Computational Linguistics (ACL) 2025, my collaborator Dr. Ruizhe Li (University of Aberdeen, UK) and I were among the first to examine a subtle but important issue: anchoring bias in multiple-choice question answering."

The team found a surprising pattern.

"Despite their sophistication, we found that LLMs can make surprisingly simple errors, such as consistently favoring one answer option regardless of content. This behavior likely stems from patterns in the training data, where certain answer formats appeared more frequently."

Rather than retraining the entire model, they targeted specific internal components.

"Rather than retraining the entire model, which is resource-intensive, we developed a novel and lightweight approach akin to 'brain surgery' for LLMs. By identifying the specific neurons and layers responsible for the bias, we manually adjusted the model’s internal weights through coding to reduce the anchoring effect without harming overall performance."

Clinical validation remains essential, she added.

"This process uses inter-rater reliability, where multiple human reviewers assess whether the model’s outputs are accurate, clear, organized, and clinically useful. This human-in-the-loop evaluation is essential for confirming that the model’s performance aligns with real-world expectations and professional standards."

What excites you most about the future of large language models?

Gao is currently exploring whether LLMs could help detect violent tendencies in online speech among teenagers.

"With my colleague, Quintin Myers, I am exploring whether LLMs could be used to detect violent tendencies in online speech from teenagers. Before applying them in such sensitive contexts, we need to determine whether the models themselves exhibited any inherent bias or violent tendencies. In a recent study, we tested several well-known LLMs from different countries by prompting them with morally ambiguous scenarios framed by teenagers of various ethnicities and locations. On the surface, the models responded politely and non-violently. However, a deeper analysis of the models’ internal parameters revealed that many had underlying tendencies toward violent responses, especially when certain personas were used. These tendencies varied significantly across different ethnic and demographic prompts, raising concerns about bias and fairness. The findings suggest that while LLMs may appear neutral in their outputs, their internal decision-making processes can still reflect problematic patterns learned during training, highlighting the need for caution and further research before deploying them in real-world, high-stakes applications like violence detection." 

Frequently Asked Questions

What is the difference between AI and a large language model?
 Artificial intelligence refers to a broad category of computer systems that perform tasks requiring human intelligence. A large language model is a specific type of AI trained primarily on text data to understand and generate language.
How do large language models work?
 Large language models use transformer-based neural networks to predict the next word in a sequence, allowing them to generate coherent, context-aware responses. 
Are large language models used in healthcare?
 Yes. Large language models are being used to summarize electronic health records, draft patient messages and support clinical documentation — with human oversight. 
Can large language models think like humans?
 No. While they can mimic reasoning and emotional language, they do not possess consciousness, lived experience or true understanding. 
Featured Experts
Staff Mention

Yanjun Gao, PhD