Exploring transcription in tens of thousands of samples with Snaptron2

Christopher Wilks* 1, Jonathan Ling 2, Rone Charles 1, Ben Langmead 1, [1] Department of Computer Science, Johns Hopkins University, [2] Department of Neuroscience, Johns Hopkins University

Poster

As more and larger genomics studies appear, there is a growing need for comprehensive and queryable cross-study summaries. Such summaries make it easy for researchers to reproduce past studies, combine data in new ways, and test hypotheses using vast datasets that would otherwise be too expensive or difficult to obtain. We previously created Snaptron[1], a resource and software tool allowing researchers to pose sophisticated mRNA-splicing-related queries and quickly (within seconds) obtain results summarized across tens of thousands of public run accessions. Building on this foundation, here we discuss the successor to Snaptron, Snaptron2. Beyond the junction data indexed by Snaptron, Snaptron2 indexes gene, exon, and base-level coverage across 100,000 bulk samples and single-cell accessions. Like Snaptron, sample metadata is also indexed and queryable. Here we describe Snaptron2 and its indexing and query strategies. We show its utility by applying it in a few scientific investigations that benefit from many archived datasets. First, we use the indexed base-level coverage data to confirm novel transcription start sites (TSS) found in a comparison of 5′ sequencing assays[2]. Second, we re-investigate genomewide intron retention studies in TCGA and GTEx. Third, we conduct a screen of all repetitive elements, tallying evidence of novel exon splicing associated with specific tissues/diseases. Finally we target single-cell RNA (scRNA) studies as separate Snaptron2 compilations as well as make them available in the recount resource[3]. Combining the per-base coverage resolution with the specificity of scRNA studies, Snaptron2 can aid the development of scRNA tools for quickly aggregating the expression of a specific gene or genes across many or even all cells and/or cellular subtypes.

Overall, Snaptron2 is a unique resource and tool that can answer sophisticated transcription-related queries while leveraging tens of thousands of valuable public sequencing datasets.

1. Wilks, C, Gaddipati, P, Nellore, A, Langmead, B (2018). Snaptron: querying splicing patterns across tens of thousands of RNA-seq samples. Bioinformatics, 34, 1:114-116.

2. Adiconis, X, Haber, AL, Simmons, SK, Levy Moonshine, A, Ji, Z, Busby, MA, Shi, X, Jacques, J, Lancaster, MA, Pan, JQ, Regev, A, Levin, JZ (2018). Comprehensive comparative analysis of 5′-end RNA-sequencing methods. Nat. Methods, 15, 7:505-511.

3. Collado-Torres, L, Nellore, A, Kammers, K, Ellis, SE, Taub, MA, Hansen, KD, Jaffe, AE, Langmead, B, Leek, JT (2017). Reproducible RNA-seq analysis using recount2. Nat. Biotechnol., 35, 4:319-321.