One in five patients arrives with a medical chart that rivals literary giants like Moby Dick, The Fellowship of the Ring and Les Misérables with somewhere around 200,000 words of clinical notes, test results, and history. It’s the narrative equivalent of driving into a sea of data before ever setting sail on treatment.
Large language models (LLMs) are emerging as a promising solution to combat note bloat and reduce physician workload. However, they still struggle to capture all relevant information, which can lead to critical errors in healthcare settings. Evaluation methods for LLM-generated summaries lag, with current automated metrics failing to assess clinical relevance and factual accuracy.
To address this gap, researchers developed the Provider Documentation Summarization Quality Instrument (PDSQI-9), building on the established PDQI-9 framework. Led by Yanjun Gao, PhD, assistant professor of biomedical informatics at the University of Colorado Anschutz Medical Campus, the collaboration included teams from the University of Wisconsin–Madison, UW Health, Epic Systems, and Memorial Sloan Kettering Cancer Center.
“Summarization transforms an overwhelming volume of patient data into clear, concise bullet points, each capturing the most critical updates in a patient’s care,” said Gao. “Our goal is to ensure these summaries support more efficient and streamlined clinical workflows.”
To build a tool for evaluating AI-generated clinical summaries, researchers began with 1.8 million patient visits from UW Health between March and December 2023. After applying privacy filters and selecting patients with at least three prior visits, they curated a diverse sample of 200 patients across 11 specialties. This dataset enabled the generation of over 750 summaries, offering the complexity needed to test AI performance.
To evaluate the quality of AI-generated clinical summaries, the team assembled a diverse panel of clinicians, data scientists, and technologists. Together, they developed and refined structured evaluation criteria.
Clinicians then independently reviewed the summaries, answering more than 8,000 questions. These assessments were conducted without collaboration to ensure unbiased, individual judgments. Inter-rater reliability was measured to validate the consistency and robustness of the framework.
“Even trained clinicians don’t always agree on what’s important in a summary,” said Gao, “We need tools that can measure quality despite variability.
This variability highlighted the importance of inter-rater reliability, which refers to the extent to which different reviewers agree in their evaluations. Ensuring consistent judgment across professionals with varied backgrounds became a cornerstone of the evaluation process.
Given the inherently subjective nature of summarization, the team focused on crafting evaluation questions that minimized ambiguity and encouraged consistent interpretation. These questions were used to assess summaries across four key dimensions: organization (how well the summary is structured, such as grouping diagnoses, medications, and instructions), clarity (how clearly the information is communicated), accuracy (whether the summary correctly reflects the patient’s clinical details), and usefulness (how helpful the summary is to a clinician preparing for a visit).
Inter-rater reliability became a central focus of the evaluation process. Even among experienced clinicians, perspectives can vary. What one expert sees as essential, another might overlook. Gao added, “If two LLMs with different ‘personalities can agree on a summary, that’s a strong signal of quality, just like two doctors reaching the same conclusion.”
To validate the tool, the team measured how consistently reviewers rated the same summaries. They aimed for strong alignment on straightforward cases while accepting minor differences on more nuanced ones. For example, a rating difference of four versus five on a five-point scale was acceptable, whereas a one versus five indicated a breakdown in consensus.
By refining the evaluation criteria to reduce ambiguity, the team was able to minimize extreme disagreements and ensure the tool could reliably distinguish between high- and low-quality summaries. This approach mirrors the complexity of real-world clinical practice, where perfect agreement is rare, but a shared understanding of what matters most is essential.
The PDSQI-9 performed exceptionally well. It showed strong agreement among different reviewers, consistency in how it measured quality, and a clear structure that captured the most important aspects of a good clinical summary, like clarity, accuracy, organization and usefulness. Most importantly, it was able to tell the difference between high-quality and low-quality summaries, which means it can be a valuable tool for health systems looking to evaluate how well AI models are doing.
These results show that the PDSQI-9 is a reliable and valid way to assess AI-generated clinical summaries. That means it can help healthcare organizations safely and effectively bring large language models (LLMs) into real-world settings, supporting doctors, improving workflows, and ultimately helping patients, all while keeping quality and safety front and center.
This study is a step toward reimagining how clinicians interact with information. As major tech companies like Epic and Google race to develop AI-driven tools for healthcare, clinical summarization has emerged as a key focus. These companies aim to streamline provider workflows by generating concise, accurate summaries of patient histories.
“The essential question is how do we know these systems are truly effective?” said Gao. “Before we can build the product, we need to be confident it works, and that’s exactly what this tool is designed to determine.“
By developing and validating the PDSQI-9 and creating a robust evaluation dataset, the team has laid the groundwork for assessing the quality of AI-generated clinical summaries in a standardized, reliable way. The hope is that this framework will not only guide future research but also inform the development of real-world products, tools that could one day be integrated into electronic health record systems like Epic’s, helping clinicians make faster, more informed decisions without losing the human touch.