- CU Anschutz researchers trained open-source AI models using synthetic (realistic but fake) medical text to automate repetitive radiology tasks—saving time and protecting patient privacy.
- AIDA, a new AI-powered documentation tool, is now helping CU radiologists streamline their workflow.
- The team plans to apply this method to other specialties and test it in real hospital settings.
- Synthetic data could make healthcare AI faster, safer and more accessible—without compromising patient confidentiality.
Radiologists send a significant portion of their day dictating and organizing medical reports, especially when evaluating thyroid nodules. These nodules are small lumps that form in the thyroid gland and radiologists use a standardized system called the American College of Radiology Thyroid Imaging Reporting and Data System (ACR TI-RADS ) to describe and classify them. TI-RADS helps determine whether a nodule is likely to be benign or cancerous, guiding decisions about follow-up care or biopsy.
The TI-RADS system requires radiologists to describe specific features of each nodule, such as its shape, size and composition, and then enter that information into a structured template. While this process is essential for patient care, it’s also highly repetitive and time-consuming.
Now, a team of researchers at the University of Colorado Anschutz (CU Anschutz) has developed a promising method to train artificial intelligence (AI) models to perform this task using synthetic data that mimics real medical information, without exposing any actual patient details. The study was published in Nature in July 2025.
Meet the Team Behind the Innovation
The research team includes Aakriti Pandita, MD, assistant professor of medicine and secondary faculty member in the Department of Biomedical Informatics at the CU Anschutz School of Medicine. Pandita and her colleagues saw an opportunity to leverage open-source large language models (LLMs) to streamline this process, without compromising patient privacy.
“Many of the repetitive tasks consuming clinicians’ time away from patients are ideal for automation,” said Pandita. “Training models safely requires large datasets, which are hard to obtain due to privacy concerns. Synthetic data offers a way forward.”
This research tackles a major bottleneck in radiology by automating repetitive documentation tasks, allowing clinicians to spend more time on direct patient care and complex decision-making.
How Synthetic Data Works and Why It Matters
To train their AI models, the team needed thousands of examples of thyroid nodule reports. However, using real patient data would raise significant privacy concerns. Instead, they programmatically generated 1,000 unique thyroid nodules using a computer script in Python. Each nodule included different combinations of features from the TI-RADS system, like whether the nodule had calcifications or irregular borders.
To make the data more realistic, they randomly excluded certain features that often occur in real dictations and used GPT-4, a powerful AI model, to generate three different versions of the radiologist’s dictation for each nodule. These dictations varied in length, phrasing and even included occasional errors, just like real-world reports.
In total, they created 3,000 synthetic dictations. A sample of these was manually reviewed to ensure they were clinically accurate and free of “hallucinations”, a term used in AI to describe made-up or incorrect information. The model was also tested using real-world data from a publicly available, de-identified medical database called MIMICIII.
By using synthetic data, the team demonstrated a privacy-preserving way to train AI models without ever exposing real patient information, an essential step toward safe and scalable healthcare AI.
Protecting Patient Privacy with AI
One of the biggest advantages of this approach is that it keeps patient data safe. Because the AI models are trained on synthetic data and hosted on secure hospital servers, there’s no need to send sensitive information to external cloud services. This gives hospitals more control over how data is used and stored.
“This approach can reduce reliance on third-party hosted proprietary models, thus eliminating patient privacy concerns,” added Pandita. “Building an army of fine-tuned expert models, operating in an agentic fashion, is one approach to AI being used in large, complex healthcare settings.”
The team has already put their research into practice by developing a tool called the Artificial Intelligence Documentation Assistant (AIDA), which is integrated into the hospital’s electronic health record system and helps radiologists complete documentation tasks more efficiently. It’s currently being tested in a clinical pilot funded by the Colorado SPARK award and the CU Office of Faculty Experience.
What’s Next: Scaling AI Across Medicine
The researchers believe this approach can be expanded to other areas of radiology, such as LI-RADS (used for liver imaging) and Bosniak scoring (used for kidney cysts), as well as other specialties like pathology and general clinical notes.
They also see exciting potential in using AI to convert old free-text reports into structured formats. This could help hospitals and researchers build large, searchable databases for studies, registries and quality improvement programs, without having to manually reprocess thousands of records.
"Once an open model is fine-tuned and validated, it can be used to batch-process historical free-text reports to produce structured datasets for research, registries and quality improvement. Human checks and sampling will be needed to ensure quality," said Pandita. "Our results suggest this is feasible, but real-world scale conversion would require site-specific validation."
To scale this approach responsibly, institutions will need to invest in infrastructure for model deployment, validation workflows and governance frameworks that support continuous improvement.
"Synthetic data will be a major accelerant. It lowers data-sharing barriers, enables fine-tuning of open models for narrow clinical tasks, and helps create labeled datasets for supervised learning at scale," emphasized Pandita. "Combined with careful validation and governance, it can democratize AI development across healthcare systems while reducing privacy risk, but responsible practices and ongoing validation will remain essential."