Jonathan Ling1, Christopher Wilks2,3, Abhinav Nellore4,5, Ben Langmead2,3, 1: Johns Hopkins University, Neuroscience, Baltimore, MD; 2: Johns Hopkins University, Computer Science, Baltimore, MD; 3: Johns Hopkins University, Center for Computational Biology, Baltimore, MD; 4: Oregon Health & Science University, Biomedical Engineering, Portland, OR; 5: Oregon Health & Science University, Surgery, Portland, OR
De novo identification of novel transcripts is an exceptionally challenging task and researchers commonly rely on annotated transcript databases to quantify expression or alternative splicing. However, unannotated splicing events can be crucial to understanding disease and discovering new therapies. As an example, we recently developed a method for identifying novel and unannotated cryptic exons that are linked to neurodegeneration (1-3) and neuronal differentiation (4). However, this method requires extensive manual annotation and is difficult to scale across many samples.
Motivated by the vast amount of splicing data available in public, archived RNA sequencing datasets, we have extended the Snaptron software and web service (5) to enable rapid, large-scale screens for tissue and cell type-specific splicing patterns. Snaptron is the query-answering portion of a larger search engine (5-8) for splice junctions observed in tens of thousands of RNA-seq samples from the Sequence Read Archive and other large projects such as GTEx and TCGA. Using this framework, we have identified hundreds of highly incorporated, previously unannotated, cell type-specific exons and the splicing factors that regulate these exons. Snaptron has also allowed us to screen cryptic exons found in human disease (1-3) across all published datasets to identify surprising insights into etiology.
Finally, we demonstrate an intuitive web interface for visualizing a query exon’s “percent spliced in” frequency across various datasets of choice (cell types, tissues, cancer subgroups, gene knockdowns, etc.). Snaptron provides a framework that allows for extremely versatile queries and enables researchers to leverage vast datasets that would otherwise be too difficult to obtain or too computationally unwieldy to analyze from scratch. We hope that this ability to cross reference all published datasets will accelerate interdisciplinary approaches in ways that have yet to be conceived.
1. Ling JP et al, Science (2015) PMID 26250685
2. Jeong YH et al, Mol Neuro (2017) PMID 28153034
3. Sun M et al, Acta Neuropath (2017) PMID 28332094
4. Ling JP et al, Cell Rep (2016) PMID 27681424
5. Wilks C et al, bioRxiv (2017) doi: 10.1101/097881
6. Nellore A et al, Bioinformatics (2016) PMID 27592709
7. Nellore A et al, Bioinformatics (2016) PMID 27153614
8. Collado-Torres L et al, Nat Biotech (2017) PMID 28398307