<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=799546403794687&amp;ev=PageView&amp;noscript=1">

New Study Shows Pros vs. Cons of Using ChatGPT in Research Process

AI-generated annotated bibliographies may increase efficiency but are more likely to have errors, according to research by a CU Anschutz ophthalmology assistant professor.

minute read

by Tayler Shaw | April 13, 2026
Images of a person using ChatGPT on a laptop, a woman conducting research in a lab, and of a person's eyeball.

When University of Colorado Anschutz Department of Ophthalmology researcher Riaz Qureshi, PhD, MS, sets out on a new project, a key step in the investigative process is reading through previous research to find relevant facts and information. When he finds a useful research article, he will create an annotated bibliography, which is a summary of the article’s main findings and a description of the context and importance of the article in relation to other studies.

“An annotated bibliography is a really useful tool, because when you read a research article, it’s unlikely you’ll be able to remember the key points later on. This helps me organize my thoughts about each paper that I can later reference when I’m conducting and writing my own research,” says Qureshi, an assistant professor of ophthalmology at CU Anschutz and assistant professor of epidemiology at the Colorado School of Public Health.

Annotated bibliographies are commonly used by both researchers and clinicians to help inform their work. A challenge, however, is how much time it takes to comb through and summarize these papers. After a student at Johns Hopkins Bloomberg School of Public Health used ChatGPT to create an annotated bibliography rather than writing it themselves, it prompted the student’s mentor to investigate if this artificial intelligence tool could be a time-saving alternative.

“A lot of people are already using AI for this purpose,” Qureshi says. “Everyone has limited time, but we don’t want to sacrifice validity for increased efficiency. There was a need to look at this and assess if there are causes for concern with using these tools in this way.”

Faculty at Johns Hopkins University asked Qureshi, who earned his PhD at the university, to help them compare the quality and accuracy of annotated bibliographies created by humans versus large language models, a type of AI system that uses vast amounts of data to respond to prompts. The study, published in the medical journal JMIR Formative Research, ultimately found that AI-generated and human-generated annotations similarly captured the main points of biomedical articles. However, there were some notable differences, including an increased risk of errors among the AI annotations.

AI vs. human work

For their comparative study, Qureshi and his co-investigators selected 15 biomedical articles from a variety of scientific fields, from biology to mathematical sciences.

“We wanted to have a systemic approach to this and not focus on just one area of science because it is so varied,” Qureshi says.

The investigators used three different versions of ChatGPT to create annotated bibliographies for each of the 15 papers. They also had two humans — a public health professor and a recent public health master’s degree graduate — develop their own annotated bibliographies on the same set of publications.

“We compared the two humans, who have different levels of experience writing these bibliographies, to the three different large language models by using quantitative and qualitative assessments,” he says.

Based on their evaluations, the investigators found that the humans wrote shorter annotations that were easier to interpret compared to ChatGPT. However, when it came to assessing the context of the publications — meaning that the annotation provided insight on the quality of the research and described how the study fits into the bigger picture of other studies on the topic — the ChatGPT annotations proved slightly superior.

“That made sense to me, because with humans, we’re oftentimes not familiar with all the literature, so our ability to contextualize a paper is likely to be lower compared to large language models that have the whole internet available to them,” he says. “I’d expected AI to be able to bring in additional information from outside sources. But that also increases the possibility of the AI hallucinating or making factually incorrect statements, which is something we saw happen.”

The study showed that ChatGPT’s discussion of the quality and context of the publications were not always accurate, and there were more errors in the ChatGPT annotations. This can be problematic given that researchers and clinicians may rely on the information in annotated bibliographies when conducting their own research or making decisions about patient care.

Recommendations moving forward

Given the study’s findings, Qureshi and his co-investigators propose that it is permissible for clinicians and researchers to use AI to generate annotated bibliographies, but a person should always check the information that AI generates.

“You don’t want to use any information in your work unless you have verified it,” he says.

In his work, Qureshi finds annotated bibliographies to be especially helpful when he is diving into an unfamiliar topic. As a methodologist, his research often focuses on exploring how clinical evidence is collected and informs patient care, specifically looking for ways to improve the research methods being used and evaluating the quality of other studies.

“Much of my work is looking at the effectiveness of interventions and quality of systematic reviews related to the eyes and vision. I help assess the methodological limitations and provide recommendations of areas that can be improved,” he says. “I also work with the American Academy of Ophthalmology to help provide them with reliable, good quality systematic reviews to help inform their preferred clinical practice patterns.”

When Qureshi is tasked with a project on a new topic, such as a systematic review for an ocular disease he hasn’t studied before, annotated bibliographies help him get oriented and create a “little library” of resources that he can reference.

“For me, I think large language models will be most useful in helping give a sense of which papers will offer the information I am looking for,” he says. “Having AI generate annotations about multiple different papers may help reveal which papers I should review and where to focus my time. It’s all about increasing efficiency.”

In the future, Qureshi hopes to see researchers analyze whether AI could be a valuable tool for synthesizing multiple research papers at once.

“Summarizing a singular research paper is one thing, but I think the next thing we need to explore is the ability of AI to summarize multiple papers and produce a useful synthesis. That would be the greatest value to us as systemic reviewers,” he says. “Currently, based on our research, the main takeaway is that AI can be helpful, but I don’t recommend relying on it. Use it, recognize the limitations it has, and be transparent about when and how you use it.”

Featured Experts
Staff Mention

Riaz Qureshi, PhD, MS