Programming can be a language of its own.
“You can get two people speaking the same language but speaking two different dialects within programming, and they will not be able to communicate at all,” says Peter DeWitt, PhD, assistant research professor of biomedical informatics at the University of Colorado School of Medicine, who has written programs and worked with data in more languages and dialects than he can recall.
“And none of them play well with each other,” he jokes.
Communication between computer languages is one reason why sharing data and the infrastructure required to read it has become so important to medical researchers like DeWitt and other faculty members in the Department of Biomedical Informatics (DBMI).
It can help connect researchers across projects and specialties, increase transparency and accountability, and lead to impactful discoveries.
Overcoming challenges together
Data sharing can make more analysis possible.
For DeWitt, a trained biostatistician, open data allows him to look for ways an outside dataset may jive with data his colleagues are working on, or holes that could prevent the two sets of information from being a good fit.
But lingering hurdles, like protecting patient identity, still present challenges for researchers.
“We can censor data as much as we want, but the more you do in order to really protect it, the less usable the data can become,” DeWitt says.
Last year, DBMI faculty members reached a major milestone by developing standards and tools for creating phenopackets that can capture clinical data about abnormal traits, diagnoses, and treatments from electronic health records and make it sharable without compromising privacy.
More than 1 million phenopackets have now been created, fostering the power to advance medical innovations.
“We will be able to see people making discoveries that weren’t previously possible,” says Monica Munoz-Torres, PhD, associate professor of biomedical informatics who worked on the standard and tools. “They will be providing us with new knowledge.”
Building infrastructure to share and educate
The ability to effectively collaborate also goes beyond raw data. Developing workflows that can be re-used or replicated in other areas of research is an important focus for many DBMI data researchers.
In a recent article published in Scientific Data, DeWitt, Meg Rebull, DBMI research program manager, and Tell Bennett, MD, PhD, professor of biomedical informatics and pediatric critical care, describe data sharing as a necessity to maximize the actionable knowledge generated from research data.
“Given the potential to develop scientific and clinical knowledge and the NIH emphasis on data sharing and reuse, there is a need for inexpensive and computationally lightweight methods for data sharing and hosting data challenges,” the DBMI researchers write. “To fill that gap, we developed a workflow that allows for reproducible model training, testing, and evaluation.”
The team leveraged public GitHub repositories, open-source computational languages, and Docker Technology to complete the project. They also conducted a data challenge using the infrastructure they developed.
The hope is that the workflow they created can be used by other researchers.
“These methods have the potential to increase the impact of shared research data,” they say. “In addition, this approach may be useful in computational training programs, as data challenge-type exercises are popular and effective components in many courses.”