Clustermatch Correlation Coefficient (CCC): Uncovering Complex Patterns in Molecular Data

When researchers explore molecular datasets, they often rely on familiar statistical tools such as Pearson’s or Spearman’s correlation coefficients. These methods are widely used to build gene co-expression networks and identify relationships between variables. However, they are designed to capture linear or monotonic relationships and can miss more complex patterns that arise in biological systems.

To address this limitation, researchers in the Department of Biomedical Informatics (DBMI) at the University of Colorado Anschutz (CU Anschutz) developed the Clustermatch Correlation Coefficient (CCC), a statistical framework and open-source Python tool for detecting complex associations between variables, including patterns that go beyond simple linear relationships. The CCC method was originally developed by the Greene Lab in DBMI at CU Anschutz, then was published in Cell Systems in 2024.

Unlike traditional correlation metrics, CCC is designed to use clustering approaches to identify complex data patterns. While it still captures linear associations, CCC can also detect relationships that would otherwise be missed when using standard correlation methods alone.

This is the fourth installment in our ongoing series spotlighting the DBMI Wall of Software, an interactive online hub showcasing the latest open-source software and data tools for researchers. View the Wall of Software here.

Researchers at DBMI who developed CCC apply the tool to transcriptomic and genomic data to uncover the relationships between genes in regard to specific traits. Their approach is inspired by modern models of the genetic architecture of complex traits, such as the omnigenic or stratagenic models, that highlight the key role of gene networks and biological pathways in disease predisposition.

“Analyzing large gene expression compendia with methods that detect nonlinear relationships could help clarify the biological roles of understudied genes”

- Milton Pividori, PhD

Researchers in the Pividori Lab at CU Anschutz emphasize that CCC helps reveal how genes interact within broader regulatory networks, which could provide insight into both direct and indirect genetic influences on complex traits.

“For a given trait, the omnigenic model predicts that there will be some core genes that are directly impacting that trait. But there are also some peripheral genes that are regulating those core genes—therefore impacting the trait indirectly,” said Haoyu Zhang, MS, Research Services Senior Professional in the Pividori Lab at CU Anschutz. “Our lab seeks to infer any biologically meaningful relationship between two genes, but also takes a step further and looks at the whole network.”

By capturing more nuanced patterns of gene co-expression, the CCC can help researchers build a more complete view of molecular systems. For example, relationships between genes may be statistically significant across different biological contexts (such as sex, disease status or environmental exposure), but differ in their strength, leading to a complex, nonlinear pattern. CCC is designed to detect this type of structured complexity.

“The CCC finds complexity in molecular data that we miss in other methods,” said Milton Pividori, PhD, assistant professor of biomedical informatics at CU Anschutz. “It helps us build a comprehensive molecular understanding of a disease. When applied to multiple data modalities, this could help us figure out which genes might be more effective as a drug target.”

CCC-GPU: An accelerated implementation of the CCC

As molecular datasets continue to grow, computational efficiency becomes critical. To enable CCC analyses at larger scales, researchers in the Pividori Lab developed CCC-GPU, a GPU-accelerated implementation of the original method, published in Bioinformatics this month. A GPU (Graphical Processing Unit) is a piece of computational hardware that speeds up specific types of data processing, and therefore is often used in the context of deep learning and large language models (LLMs). CCC-GPU leverages the same mathematics as the CCC, but dramatically speeds up the process, making it feasible to analyze massive datasets up to 30-40 times faster than the original CCC.

"Like other approaches such as the Maximal Information Coefficient (MIC), CCC can capture nonlinear patterns in data—but far more efficiently. With CCC-GPU, we can now perform large-scale analyses in a fraction of the time," said Zhang.

Applications of CCC and CCC-GPU

Although DBMI researchers have primarily applied the CCC to transcriptomic data, it is a general statistical coefficient and can be used wherever complex relationships between variables are expected. For example, CCC could be used to give nuanced insights in data analytics across fields ranging from natural language processing (NLP) and neuroscience to marketing and public health data. In addition to exploratory analysis, CCC scores could be used in feature selection tasks before machine learning model training. By expanding how researchers measure association—and by making those analyses computationally scalable—CCC and CCC-GPU provide new tools for uncovering hidden structures in large, complex datasets.

Featured Experts

Milton Pividori, PhD

Haoyu Zhang, MS

Department of Biomedical Informatics

CCC-GPU: An accelerated implementation of the CCC

Applications of CCC and CCC-GPU

Related Articles

The Perfect Formula: What Pi Day Can Teach Us About Precision Obesity Care

Clustermatch Correlation Coefficient (CCC): Uncovering Complex Patterns in Molecular Data

Inside Pharmacogenetics: How CU Anschutz Researchers Are Using Metabolism to Guide Safer, More Personalized Treatment