A Data Democracy: Quantifying Disease Through Data Imaging

Written by Tyler Smith | January 08, 2019

If data were a traded commodity like corn and soybeans, its market price would be sky-high. Worldwide appetite for it is keen in every business sector and promises only to accelerate.

The health care world is no exception. Data is the foundation of evidence-based medicine and clinical research. So it surely follows that when it comes to advancing health care, preventing disease, and addressing public health issues, the more data we have, the better off we are.

Not so fast. Ever-deepening stores of data aren’t assets without efficient ways to access and analyze the information accurately. That requires harnessing technology that digs meaningful nuggets from mountains of data—helping clinicians make better-informed decisions about patients they treat, and assisting researchers investigating complex problems of disease and treatment, for example.

Those challenges are top of mind in the Department of Biostatistics and Informatics at ColoradoSPH. The department focuses on providing analytical and study design support for a wide range of medical and public health research. One key area is imaging analysis, said Debashis Ghosh, PhD, chair of the department.

“Every imaging modality—MRIs, CTs, PET scans—generates rich data,” Ghosh said. Rich, yes, but also raw, he added, and filled with “noise”: data that is not relevant to the researcher working on distinguishing lung nodules from healthy tissue, for example. Biostatisticians can offer help with software designed to visually enhance the relevant portions of an image—such as diseased tissue.

“That information can be used to compare groups of people with and without disease,” Ghosh said.

Quantifying disease

Sarah Ryan, a PhD student in the biostatistics and informatics program, works with Lisa Maier, MD, MSPH, a pulmonologist and chief of the Division of Environmental and Occupational Health Sciences at National Jewish Health, to improve diagnosis of pulmonary sarcoidosis, a disease that inflames and scars the delicate tissues of the lungs.

“Pulmonary sarcoidosis is considered underrecognized,” Ryan said.

Ryan set out with Maier to find ways to help radiologists find signs of the disease in lung images and make clinical judgments about it. That meant identifying the specific features of abnormal lung tissue and using software to highlight those visual clues in chest CT scans.

“The challenge is all about how we can quantify disease and find its location on the lungs using objective methods,” Ryan said. That work resulted in her master’s thesis, which addressed using the image features of pulmonary sarcoidosis to differentiate abnormal and healthy lung tissue.

With Ghosh’s help, Ryan then went on to work with an imaging group at the Johns Hopkins Bloomberg School of Public Health on a new problem: creating a pipeline for analyzing lung images. They did this by developing user-friendly software that segments lung images into pictures of the left and right side, and registers, or aligns, images into the same coordinate space so they can be analyzed more easily across people. The software includes a third element: a 3D image of an “average” lung shape based on high-resolution CTs from healthy control individuals enrolled in the COPDGene Study. Together, the package helps researchers compare images from study subjects to the average lung, Ryan said.

The tool is free and all the code and data is available publicly—a boon to research. “Being open-source advances science more quickly,” Ryan said. “By collaborating with others, we can make advances more quickly across universities.”

Democratizing data

“Democratizing” access to data and code for image analysis is a goal that Ghosh stresses, said Pam Russell, MA, a research instructor with the biostatistics and informatics department. Russell explained that getting underlying information from imaging studies about data sources and methodology can be difficult, in part because of patient privacy concerns. In a bid for greater openness, she developed a program called TCIApathfinder that enables individuals to navigate more easily The Cancer Image Archive, a database of some 30,000 deidentified images of varieties of cancer.

The archive, Russell said, is hard to navigate for someone pointing and clicking at a computer in an effort to tap its bottlenecks to research,” Russell said.

To address that, TCIApathfinder allows researchers to write lines of code in R, a free programming language frequently used by statisticians, to plumb the database. That makes it relatively easy to search for all data sets related to, say, stomach cancer. In addition, TCIApathfinder saves the commands that led to the relevant data, creating “a history of everything you’ve done in a script instead of having to do ad hoc work,” Russell explained.

The upshot for researchers is greater speed and efficiency and the ability to write shareable scripts that reproduce their work. “R is the common language of biostatisticians,” Russell said. “We’ve made a program that is accessible to a wider range of people who are interested in large-scale data analysis.”

Russell published her work in the August 2018 issue of Cancer Research. It’s been downloaded more than 1,000 times from the CRAN (Comprehensive R Archive Network) and has received favorable comments from the National Cancer Institute.

Machine “Deep Learning”

The challenges confronting clinicians and public health researchers in a data-rich world are also evident in the task of analyzing resected pancreatic tissue for signs of neuroendocrine tumors. Today, pathologists microscopically examine a field of at least 500 cells. They look for areas with high levels of mitosis, or cell division, signaled by the presence of a protein that is a marker for unchecked cell growth. The pathologist then counts the number of tumor cells in the sample that express the protein (immunopositive) and those that do not (immunonegative). They use the percentage of immunopositive cells in the entire sample to grade the tumor: 2 percent or fewer is grade 1; 3 percent to 20 percent is grade 2; and anything above 20 percent is grade 3, or a neuroendocrine carcinoma, the most serious form with the most dismal prognosis.

The accuracy of the count and the grading is very important, said Toby Cornish, MD, PhD, a pathologist with University of Colorado Hospital, because it establishes the prognosis a surgeon presents to a patient. The estimated five-year survival rate declines dramatically with a diagnosis of a grade 2 tumor as opposed to a grade 1 or a grade 3 as opposed to a grade 2.

“We’re most concerned about percentages at the fringes,” Cornish said. Can the “gold standard” of counting by a highly trained specialist be improved? Cornish wants to find out in a collaboration with Fuyong Xing, PhD, assistant professor in biostatistics and informatics. The idea: program computers to identify cells in a neuroendocrine tumor tissue sample (distinguishing both immunopositive from immunonegative or tumor from non-tumor) and grade the tumor. That could produce a more reliable result in a shorter amount of time—potentially seconds as compared to hours, Xing said.

“Doing the count manually takes a lot of time and effort and can result in significant variations between pathologists,” Xing added.

Cornish and Xing are trialing a form of machine learning dubbed “deep learning” to teach computers to pick through arrays of tissue cells and finger the tumor-causing rogues. Programmers give the system no predefined features. They present the objects—in this case, the various kinds of tumor and non-tumor cells—and the system proceeds through a series of algorithms to “teach itself” the differences between them. It’s more nuanced than traditional machine learning, in which the programmer pre-defines the features to look for and asks the computer to distinguish between them. Properly instructed, the learning system can identify the differences between a pen and a pencil or things far more complex.

Xing said he’s completed preliminary testing of the deep-learning model on sample tissue images with encouraging results, but there is still much work to be done. One essential point: the work requires annotating slides that accurately identify each type of cell in the sample.

“It’s a lot of effort, and you have to have a trained pathologist to do it,” Cornish said. “If you don’t, all you’ve done is train a computer to reproduce what a non-pathologist would do.”

The work also relies on graphic processing unit circuitry capable of producing images rapidly. “We’re figuring out how to use powerful technology to tackle medical problems,” Xing added.

Cornish said he’s grateful for his partnership with Xing and to Ghosh’s team in public health.

“There are very few people in the country who have the deep-learning experience, specifically in histopathology, [Xing] has. I am impressed with the Colorado School of Public Health and that they are working in the area of machine learning. The fact that they are branching into these new computational areas with people who are at the forefront of computation and medicine speaks volumes to their foresight about where medicine and prevention is going.”

This story was originally written for the Colorado School of Public Health’s 10th anniversary magazine.

View full post