Charlotte Darby, Ben Langmead, Michael Schatz, Computer Science, Johns Hopkins University
In contrast to germline (inherited) variants, DNA mutations occurring during development are only present in some cells of the developed individual. A healthy human is thought to harbor many benign “somatic mutations” throughout their body, but certain additional ones can be disease-causing. Somatic mutations have been implicated in autism, rare diseases, including those where the skin has a visible “mosaic” pattern, and many forms of cancer.
Short-read sequencing of paired (affected/normal) samples or a pedigree is currently used to identify somatic variants based on statistical analysis of the variants seen or not seen in the normal tissue or in the parents of an individual. While effective, these data are not always available. Instead, we use “linked reads” from the company 10X Genomics, where each linked read is a group of short reads sequenced from the same original long DNA molecule. This technology allows individual short reads at a genomic position to be grouped into those from the maternal and those from the paternal genome, enriching the simple counts of reads with each base.
We use decision-tree-based feature classification to prioritize candidate somatic mutations for further study. We apply the method to simulated 10X Genomics data as well as real data with simulated somatic mutations, which suggests that features based on linked read quality and individual-read haplotypes enable the model to substantially outperform the same model used on short reads alone.