Public sequencing data archives are growing by petabases — millions of billions of DNA letters — per year. But for typical researchers, using these data requires huge downloads and laborious re-analysis. Most researchers aren’t equipped for this, so valuable data go unused. The JHU team, led by IDIES affiliate Ben Langmead (Computer Science) and Jeffrey Leek (Biostatistics), is working to make public data easier to use by choosing valuable subsets of the archive — human RNA sequencing data in this case — analyzing it, and summarizing it using a carefully crafted and uniform bioinformatics pipeline. The team then makes resources like recount available to the community in a form that allows researchers to ask and answer sophisticated questions.
For this last step, the project depends on the SciServer system and borrows from its philosophy. SciServer provides storage, hosting, and a sophisticated interface for computing and visualizing the data “on-site” at JHU, without having to download either raw or summarized data. This allows biological researchers to use powerful, flexible systems (Jupyter, R and Python) to interact with concise summaries of many public datasets.
The recount resource is available here, and information about accessing recount via SciServer can be found here. A paper describing the effort is in press at Nature Biotechnology, and a preprint is available at this link.
The recount project was primarily the work of Leonardo Collado Torres and Abhinav Nellore. The recount team spans several departments, with most of the work taking place in the labs of Ben Langmead (Computer Science), Jeff Leek (Biostatistics), Kasper Hansen (Institute for Genetic Medicine & Biostatistics) and Andrew Jaffe (The Lieber Institute for Brain Development).