You’ve probably used tools like ChatGPT or Microsoft Copilot to help brainstorm an idea, plan an upcoming trip or view your horoscope, but have you ever wondered what powers these tools? They are built on something called a Large Language Model (LLM).
LLMs are a type of advanced artificial intelligence (AI) designed to understand and generate human-like text. They’re trained on massive amounts of text, typically scraped from the internet, so they can respond to questions, summarize information, write stories and even assist with complex tasks like coding or medical documentation.
Some of the most well-known LLMs today include GPT-4o, Claude 3, Gemini 1.5 and LLaMA 3, to name a few. These models are transforming how we interact with technology, making it more natural, intuitive and powerful.
At the University of Colorado Anschutz Medical Campus, researchers are using LLMs to summarize patient records, generate clinical notes and even assist in patient diagnosis. To better understand what an LLM is and how it works, we sat down with Yanjun Gao, PhD, assistant professor of biomedical informatics.
What is an LLM?
An LLM is a type of deep learning neural network built on a structure called the Transformer architecture, introduced in 2017. This architecture has become foundational across many areas of AI, including natural language processing (NLP), computer vision and robotics. What sets LLMs apart is that they learn by reading massive amounts of text, like websites, reviews, and social media posts, so they can understand and generate human-like language. Companies like OpenAI and Google use this approach to power tools like ChatGPT and Google AI, which generate responses based on the patterns they've learned from their training.
Can you explain the transformer architecture, especially as it relates to models like ChatGPT?
The Transformer architecture is the core technology behind tools like ChatGPT, and you can think of it like a brain made up of layers with billions of neurons that "light up" in different ways depending on the task, whether it's writing a story or solving a math problem.
These models learn by reading huge amounts of text, which helps them understand how language works and how to respond appropriately. Over time, they develop internal patterns, like mental circuits, that specialize in different types of thinking, such as answering questions or summarizing information. Just like different parts of the human brain handle different tasks, researchers have found that different parts of a Transformer model become active for different kinds of language challenges. Interestingly, researchers have found that, much like in the human brain, different parts of the model become active for different tasks, showing that it has developed areas of specialization.
How is an LLM different from regular AI?
Many people often confuse LLMs with AI, but LLMs are a specific type within the broader AI category. AI encompasses a wide range of systems that vary based on their input types, tasks and underlying architectures.
As discussed before, what sets LLMs apart is that they are primarily trained on massive amounts of text data. Their core training method involves predicting the next word in a sentence, which helps them learn language patterns and contextual meaning. LLMs, known as multimodal models, can also handle images, speech or video, but text is still their primary focus. For example, ChatGPT is great at writing and answering questions, while tools like DALL·E are designed to create images from text prompts.
How is an LLM trained?
Large language models (LLMs) are trained in three main stages, like teaching a student. First, they read massive amounts of text and predict missing words, a phase called pretraining. Next, during instruction tuning, they practice following human directions by studying examples of questions and appropriate responses. Finally, in reinforcement learning, humans review the model’s answers and give feedback to help it improve, ensuring it responds in ways that are helpful, safe and aligned with human values. This alignment process is crucial for ensuring the model behaves safely and helpfully in real-world use.
Does sharing personal information with ChatGPT influence the model’s overall training or behavior?
While the exact inner workings of models like ChatGPT aren’t publicly disclosed, since companies like OpenAI treat them as proprietary, there’s a general understanding of how personalization might work. Although the core model itself isn’t retrained for each user, it can adapt to individual preferences through interaction. For example, while your personal data doesn’t go into retraining the model itself. Instead, the system may remember your preferences during a session to make the conversation smoother, like a barista remembering your coffee order. This personalization is how humans remember past conversations and tailor future ones, reinforcing the idea that these models mimic certain aspects of human cognition.
Can LLMs be trained to think and understand like humans?
Yes and no. LLMs can perform certain tasks like solving some types of math problems or writing code at or above human levels, which suggests they have capabilities that superficially resemble aspects of human cognition, like pattern matching and composing solutions through multiple steps. However, whether they truly think like humans is still an open question. Recently, Apple published the Illusion of Thinking paper, showing that LLMs can struggle with symbolic reasoning under certain conditions. While I believe this critique highlights important limitations, I also see evidence that LLMs can learn useful reasoning heuristics from data.
Crucially, what distinguishes humans from LLMs is that our thinking is shaped not only by logic but also by emotions, personal experiences, and embodied perceptions. While LLMs are trained on vast amounts of text and can mimic emotional language, they don’t actually feel emotions, because true emotion arises not just from the brain but from the entire human body, our nervous system, heart and physical sensations. So, while LLMs can imitate aspects of human thought, they lack the emotional and experiential depth that defines human intelligence.
What impact are LLMs having on the healthcare industry?
One of the most common real-world applications of LLMs in healthcare today is automated message drafting within electronic health record (EHR) systems, like Epic’s MyChart. The model, such as GPT-4, generates responses to patient messages in this setup. These AI-generated drafts are then reviewed by physicians or nurses, who either approve them as-is or make minor edits before sending. This helps streamline communication between patients and healthcare providers, saving time while keeping a human in the loop for oversight and accuracy.
How are you utilizing LLMs in your lab?
My lab is exploring the application of LLMs in the summarization of EHRs. This is especially valuable in addressing clinician burnout, which is often caused by the overwhelming volume of patient data, particularly in complex cases like elderly patients in the ICU, where new information is generated constantly. Summarization tools help by condensing large amounts of structured data (like vitals and lab results) and unstructured data (like physician notes) into concise, readable summaries. These summaries highlight key diagnoses and developments over a specific time frame, such as the past 48 hours, making it easier for clinicians, especially those coming on shift, to understand a patient’s status quickly. Companies like Epic are actively developing LLM-based summarization systems to improve clinical workflow efficiency and reduce cognitive overload for healthcare providers.
How do you test whether an LLM is ready to be deployed in a hospital setting?
Determining when an LLM is ready for deployment depends on how well it performs in practice and whether its limitations are understood and managed. In a recent study accepted to the Association for Computational Linguistics (ACL) 2025, my collaborator Dr. Ruizhe Li (University of Aberdeen, UK) and I were among the first to examine a subtle but important issue: anchoring bias in multiple-choice question answering.
Despite their sophistication, we found that LLMs can make surprisingly simple errors, such as consistently favoring one answer option regardless of content. This behavior likely stems from patterns in the training data, where certain answer formats appeared more frequently. Rather than retraining the entire model, which is resource-intensive, we developed a novel and lightweight approach akin to “brain surgery” for LLMs. By identifying the specific neurons and layers responsible for the bias, we manually adjusted the model’s internal weights through coding to reduce the anchoring effect without harming overall performance.
To ensure these tools are truly effective and trustworthy, especially in high-stakes domains like healthcare, clinicians often validate their outputs, such as summaries. This process uses inter-rater reliability, where multiple human reviewers assess whether the model’s outputs are accurate, clear, organized, and clinically useful. This human-in-the-loop evaluation is essential for confirming that the model’s performance aligns with real-world expectations and professional standards.
What excites you most about the future of LLMs?
With my colleague, Quintin Myers, I am exploring whether LLMs could be used to detect violent tendencies in online speech from teenagers. Before applying them in such sensitive contexts, we need to determine whether the models themselves exhibited any inherent bias or violent tendencies.
In a recent study, we tested several well-known LLMs from different countries by prompting them with morally ambiguous scenarios framed by teenagers of various ethnicities and locations. On the surface, the models responded politely and non-violently. However, a deeper analysis of the models’ internal parameters revealed that many had underlying tendencies toward violent responses, especially when certain personas were used. These tendencies varied significantly across different ethnic and demographic prompts, raising concerns about bias and fairness. The findings suggest that while LLMs may appear neutral in their outputs, their internal decision-making processes can still reflect problematic patterns learned during training, highlighting the need for caution and further research before deploying them in real-world, high-stakes applications like violence detection.