Imagine making a traditional Sicilian pizza for dinner. You don’t want the same old flavor, so you experiment by adding a new ingredient like pears, chiles, or flavored olive oil. Sometimes, these new ingredients make the pizza taste even better, but other times, it might not taste as expected.
De novo mutations are similar. Everyone has more than 100 de novo mutations not inherited from either parent. When the DNA (new ingredient) is copied, changes occur (pizza flavor) and create genetic variations. Some variations might help an organism adapt to its environment, while others could lead to health issues or disease. Most de novo mutations are neutral, like adding a few basil leaves to your pizza.
Studying de novo mutations across generations, from parents to children, is essential for understanding how these genetic changes develop and evolve. However, the current method for identifying these mutations, short-read data sequencing, presents several limitations.
To mitigate the limitations of short-read sequencing, Harriet Dashnow, PhD, assistant professor of biomedical informatics at the University of Colorado School of Medicine on the CU Anschutz Medical Campus, as part of the Platinum Pedigree Consortium, integrated long-read data sequencing technologies like PacBio HiFi and Oxford Nanopore Technologies (ONT) to span large repetitive genome regions to call de novos they couldn’t before. The consortium included researchers from the University of Utah, University of Washington Genome Sciences, and PacBio.
Examining a larger portion of the genome enabled the team to gain insights into de novo mutations and trace their inheritance across generations. DNA samples were obtained from four generations of family members who were part of the International HapMap Project and the 1000 Genomes Project.
Long-read vs short-read data sequencing
Short-read data sequencing is a method that involves breaking the genome into small fragments, typically around 150 bases. While useful, this traditional method often leads to gaps in sequencing data because researchers can only analyze small portions of the genome at a time.
In contrast, long-read data sequencing can break the genome into thousands of fragments, typically 15,000 bases with PacBio HiFi. Dashnow explained, “Previous studies using short reads and focusing on non-repetitive regions found only 60-70 de novo variants per person. We saw more than double that when we could look at more of the genome. Part of that is because not only are we looking at more of the genome, but we’re also looking at the part of the genome that is more prone to mutations.” This capability allows researchers to assemble and analyze more repetitive and complex genome regions, providing a more complete picture. However, long-read sequencing is generally less accurate than short-read sequencing, leading to a higher potential for errors.
Exploring the highly mutable but hard-to-sequence tandem repeats
Dashnow was responsible for genotyping the tandem repeats. A tandem repeat is part of the DNA where a small segment of genetic code is repeated in a row. Researchers use them to track how genes are passed down in families. This helps identify genetic diseases. While tandem repeats only comprise about 9% of the genome, they contribute to most de novo mutations found.
Because PacBio HiFi sequencing technology can often produce false positives, complicating the identification of genuine mutations, researchers, in collaboration with Tom Mokveld, bioinformatics scientist at PacBio and his team, created TRGT-denovo, a computational tool designed to detect all types of new TR mutations accurately—whether they expand, contract, or change—in family groups.
The research team studied eight family trios with suspected rare diseases and successfully identified relevant mutations. Their family structure confirmed the findings: if a mutation was found in one family member, they could also check the children to see if it was present.
Because TRGT-denovo analyzes raw sequencing data directly, the team could spot slight changes that standard methods often miss. Not only was TRGT-denovo able to identify mutations, but it eliminated false positives in the process. Using a larger catalog of repeat sequences uncovered new potential mutations, 95% of which were confirmed through additional experiments. This shows TRGT-denovo effectively reduces false positives while reliably detecting true de novo mutations.
Digging into the most mutable parts of the genome
Future studies will examine if some sequences are more likely to have mutations within the repetitive sequences, if they can predict which are more mutable than others, and why some mutations occur in almost every generation.
They're not expecting the mutations found in this family to be associated with disease. Still, Dashnow added, "We can use the individuals who don't have a disease to understand how this is happening outside of the disease context and then apply that information to families who do potentially have a disease and use that to predict what might happen to them."
While the final paper is several months away, the team's preprint is available at bioRxiv. Dashnow expressed excitement that the data from 23 of 28 family members will be made public, “If we have genomes that are accessible and available to the community, we can replicate each other's work. People who work in less well-funded or less well-connected institutions can get access just as easily as someone working in the NIH.” Publicizing the data will allow any researcher to obtain the data to answer some of science's most complex questions.