For William Mundo, MD, MPH adjoint clinical instructor in emergency medicine, the challenges of language barriers in healthcare are both professional and deeply personal.
“As a bilingual clinician, and as the son of two immigrants who did not speak English, this work is deeply personal,” said Mundo. “I saw firsthand how language barriers affected by own family’s ability to navigate healthcare, and I see those same challenges in my patients every day.”
Those experiences helped shape a new study examining how artificial intelligence translates emergency department discharge instructions for Spanish-speaking patients – and whether current evaluation methods can reliably determine if those translations are safe.
Led by Mundo and co-advised by Elizabeth Goldberg, MD, ScM, FACEP, associate professor of emergency medicine and associate vice chair of research at the University of Colorado Anschutz School of Medicine Department of Emergency Medicine, the study investigates discrepancies between automated evaluation metrics and human review when assessing large language model (LLM)-translated discharge instructions. The study, was presented at the 2026 American Medical Informatics Association (AMIA) Amplify Conference, examines how automated evaluation methods compare with clinician review when assessing AI-translated Spanish discharge instructions.
Why Discharge Instructions Matter
Discharge instructions are among the most important patient-facing documents provided at the end of an emergency department visit. They often include medication directions, follow-up recommendations, guidance on symptoms monitoring , and warnings about when to seek immediate medical attention. For patients with limited English proficiency, accurately translated instructions can play a critical role in supporting safe outpatient care.
“In emergency medicine one of the highest risk moments is when patients are discharged,” said Goldberg. “The care transition from the ED to home or the ED to a facility is fraught with many problems even when clinicians and nurses try to optimize communication.”
Unlike inpatient settings, emergency departments focus on addressing acute medical concerns, providing stabilization, and ensuring patients can safely transition to the next stage of care. Patients are then expected to manage their care independently after discharge, frequently following brief high-stress clinical encounters with limited opportunity for clarification.
Researchers say those challenges become even greater when language barriers are involved.
“Clinicians are trying to do the right thing by using translation tools like Google Translate, but we know these to be inaccurate at times,” Goldberg explained. “Patients may not comprehend them well enough to act on the clinician’s intended advice.”
The growing availability of LLMs has created new opportunities to improve language-concordant communication in healthcare, particularly in busy emergency departments where interpreter services and professional translation resources may be limited. But the study’s authors caution that translation quality must be evaluated carefully before such systems are widely deployed in clinical care.
Comparing Automated Metrics and Human Review
To assess translation quality, the research team compared automated scoring systems with structured human review. Automated metrics included BLEU, NIST, ROUGE, and concept-based embedding cosine similarity – tools commonly used in machine translation and natural language processing research.
According to Yanjun Gao, PhD, assistant professor of biomedical informatics, and co-director of the CU Anschutz Center for Health AI, the metrics were selected to represent both traditional word-overlap approaches and newer semantic similarity methods.
BLEU, NIST, and ROUGE evaluate how closely a translated sentence resembles a reference by translation by measuring shared words and phrases. NIST differs slightly by assigning greater weight to more informative or clinically meaningful terms. Embedding cosine similarity attempts to measure meaning preservation by comparing translations in semantic vector space rather than relying solely on exact wording.
The team also conducted human evaluation using the Multi-Dimensional Quality Metrics framework, or MQM, which assesses dimensions such as accuracy, terminology, fluency, and appropriate audience. Clinicians reviewed translations to determine whether the instructions preserved clinically important meaning and remained understandable for patients.
Where Metrics Fell Short
The researchers found that automated metrics and human judgments did not always align.
“We found that automated metrics and MQM often suggest high quality translations using AI, but clinician review identified disagreement in meaning-sensitive areas like accuracy and fluency,” said Mundo.
Many of the discrepancies involved errors with potentially serious clinical implications. Automated metrics frequently failed to identify mistranslated medication instructions, omitted return precautions, or mistakes involving negation and urgency.
“One example would be translating ‘take with food’ as ‘do not take with food,’” explained Gao. “Because there is substantial word overlap, the translation may still score acceptably despite representing a potentially harmful instruction.”
Researchers also noted that general domain embedding models may struggle to distinguish subtle but clinically significant differences, such as “daily versus “twice daily” or “mild discomfort” versus “severe pain.”
To better characterize the impact of these issues, clinician reviewers categorized errors according to their potential clinical risk, distinguishing between stylistic issues, ambiguous wording, and errors that could plausibly alter patient behavior or compromise safety.
Building Better Evaluation Frameworks
The study additionally examined factors influencing translation performance, including model selection, prompt design, and source-text complexity. Larger models with stronger instruction-following capabilities generally produced more coherent translations, especially for longer discharge instructions involving multiple medications or conditional recommendations. Explicit prompting about clinical context and patient audience also improved fluency and readability during human review.
Still, the researchers emphasize that automated evaluation alone is insufficient in high-stakes clinical settings.
“Our findings suggest these tools should not be deployed based on automated scores alone,” said Mundo.” Health systems need a layered evaluation approach: automated metrics for model selection, structured MQM review for domain-level assessment, and targeted human oversight for high-risk content where clinical meaning matters most.”
Ultimately, the study argues that evaluating clinical translation requires more than measuring textual similarity.
“A discharge instruction is not successful because it is similar to a reference translation,” said Gao. “It is successful if a patient with limited English proficiency can understand it, trust it, and act on it correctly.”