Christopher Wilks1,2, Jonathan Ling6, Phani Gaddipati4, Abhinav Nellore5,6, Ben Langmead1,2, 1. Department of Computer Science, Johns Hopkins University 2. Center for Computational Biology, Johns Hopkins University 3. Department of Neuroscience, Johns Hopkins University 4. Department of Biomedical Engineering, Johns Hopkins University 5. Department of Biomedical Engineering, Oregon Health & Science University 6. Department of Surgery, Oregon Health & Science University
As more and larger genomics studies appear, there is a growing need for comprehensive and queryable cross-study summaries. These enable researchers to leverage vast datasets that would otherwise be too difficult to obtain or too computationally unwieldy to analyze from scratch. We present Snaptron , a search engine for summarized RNA sequencing data. It serves data from over 70,000 human RNA-seq samples, analyzed using Rail-RNA [2, 3] and also served in a more raw form by recount2 . Snaptron’s computational core is a query planner that leverages R-tree, B-tree and inverted indexing strategies to rapidly execute queries over 146 million exon-exon splice junctions from over 70,000 human samples.
The easiest way to use Snaptron is via its RESTful web service interface (http://snaptron.cs.jhu.edu), which allows researchers to immediately start posing queries (e.g. simply starting with a gene name) with little or no software installation. Most queries take only a few seconds and can be tailored by constraining which junctions and samples to consider. Snaptron can score junctions according to tissue specificity or other criteria. Importantly, Snaptron can also score samples according to alternative splicing patterns by calculating the “percent spliced in” of individual exons. Using this framework, we have identified hundreds of previously unannotated cell type-specific exons and the splicing factors that regulate these exons. We further highlight several case studies relevant to human disease to illustrate the versatility of Snaptron.