A multi-experiment resource of analysis-ready RNA-seq gene and exon count datasets

recount2 is an online resource consisting of RNA-seq gene and exon counts as well as coverage bigWig files for 2041 different studies. It is the second generation of the ReCount project. The raw sequencing data were processed with Rail-RNA as described in the recount2 paper and at Nellore et al, Genome Biology, 2016 which created the coverage bigWig files. For ease of statistical analysis, for each study we created count tables at the gene and exon levels and extracted phenotype data, which we provide in their raw formats as well as in RangedSummarizedExperiment R objects (described in the SummarizedExperiment Bioconductor package). We also computed the mean coverage per study and provide it in a bigWig file, which can be used with the derfinder Bioconductor package to perform annotation-agnostic differential expression analysis at the expressed regions-level as described at Collado-Torres et al, Nucleic Acids Research, 2017. The count tables, RangedSummarizeExperiment objects, phenotype tables, sample bigWigs, mean bigWigs, and file information tables are ready to use and freely available here. We also created the recount Bioconductor package which allows you to search and download the data for a specific study. By taking care of several preprocessing steps and combining many datasets into one easily-accessible website, we make finding and analyzing RNA-seq data considerably more straightforward.


Main publication


Related publications


The Datasets


Download list of studies matching search results Note that GTEx is separated from this list.

Authors


Support

If you need help with the recount Bioconductor package please get in touch via the Bioconductor support website (remember to use the recount tag). Please check this post on how to ask for help for Bioconductor packages. For support on reproducing the recount2 project, please get in touch via the recount-website repository.

This tab shows the information for the GTEx project. Due to its size, we also provide ranged summarized experiment objects (RSE) by tissue at the gene and exon levels.

This tab shows the information for the TCGA project. Due to its size, we also provide ranged summarized experiment objects (RSE) by tissue at the gene and exon levels.

recount2 objects documentation

All columns of the table below are sortable and searchable. The columns are as follows:

accession

The SRA accession identifier for the study. The link points to SRA for a full description of the study.

number of samples

The total number of samples available for the given study. Note that in some exceptional cases not all samples for a given study were analyzed with Rail-RNA.

species

The species of the samples under study.

abstract

The abstract describing the study as provided by SRA via the SRAdb Bioconductor package.

RSE gene

The RangedSummarizedExperiment object for the counts summarized at the gene level using the Gencode v25 (GRCh38.p7, CHR) annotation as provided by Gencode. Note that the GRanges object recount::recount_genes includes bp_length as a metadata column which is the sum of the exon widths. If two exons are overlapping, the overlapped bases are only counted once.

counts gene

A tsv file with the count matrix used to create the RangedSummarizedExperiment object at the gene level. Version 2 files include the gene ids in an extra column.

RSE exon

The RangedSummarizedExperiment object for the counts summarized at the exon level using the Gencode v25 (GRCh38.p7, CHR) annotation as provided by Gencode. This GRangesList object recount::recount_exons has 1 element per gene. For each gene this object contains the reduced exons (version 1) or disjoint exons (version 2) such that they are non-overlapping within a gene. See the version section below for more information.

counts exon

A tsv file with the count matrix used to create the RangedSummarizedExperiment object at the exon level.

RSE junctions

The RangedSummarizedExperiment object for the counts summarized at the junction level. This GRanges object has 1 element per junction. For each junction this object contains the transcript names for the Gencode v24 junctions, transcript names and gene ids matching those used in the RSE objects at the gene or exon levels (based on Gencode v25 CHR regions), junction class, proposed gene ids and symbols. The junction ids match those used in the jx_cov and jx_bed files explained below. This file is present only if the project has at least one junction detected.

counts jx

A tsv file with the count matrix used to create the RangedSummarizedExperiment object at the junction level. This file is present only if the project has at least one junction detected.

RSE transcript

A RangedSummarizedExperiment object with the transcript quantifications as done by Fu et al, bioRxiv, 2018.

Junction raw coverage file
The link jx_cov points to the raw junction coverage file that contains the junction ids (comma-separated), the sample ids (comma-separated), and the actual coverage value for the junction. Sample ids can be matched to run and project accession numbers using sample_ids.tsv (junction id, project accession, run accession).
Junction BED file

The link jx_bed points to the BED file with one entry per junction present in the given project. The name of the junction includes the junction id, the donor, acceptor, and overall (if present in both) transcript names based on Gencode v24. See the known issues for an important detail on these BED files.

phenotype

The phenotype information (sample metadata) in a tsv file used for both RangedSummarizedExperiment objects. The table includes the SRA study id, the SRA sample id, the SRA experiment id, the SRA run id, the reads counts as reported by SRA, the number of reads aligned by Rail-RNA, the proportion of reads reported by SRA that aligned, whether the sample was paired-end or not, whether we think SRA misreported the paired-end label, the number of mapped read count by Rail-RNA, the coverage AUC, the SHARQ prototype tissue, the SHARQ prototype cell type, the biosample sumbmission date, the biosample publication date, the biosample update date, the average read length, the GEO accession id, the sample title as extracted from GEO, the sample characteristics as extracted from GEO, and the name of the coverage bigWig file.

files info

A tsv file with the names of the uploaded files, the md5sum, the size in bytes and the url to download the actual files. Version 2 includes the md5sum for the transcript files.

Annotation used

The Gencode v25 GFF3 file with the comprehensive gene annotation (CHR regions) is available at the Gencode website.

FANTOM-CAT/recount2 RSE file

RangedSummarizedExperiment object with the FANTOM-CAT/recount2 annotation counts. See Imada, Sanchez et al, bioRxiv, 2019 for more information on how this annotation was determined. The rowRanges() slot is a GRangesList with one element per gene, and then exons coordinates for that given gene. For the md5sum and file size information for these files, please check the fc_rc_files_info.tsv file.


Getting started with RangedSummarizedExperiment objects

Check out the recount Bioconductor package for how to download data from the recount2 project and get started with your own analysis. The quick start vignette is particularly helpful. For more details check the SummarizedExperiment Bioconductor package for an overview on RangedSummarizedExperiment objects. Most RNA-seq differential expression Bioconductor packages use them. In particular, the vignette of the DESeq2 Bioconductor package shows in detail how to perform an analysis with them. You might also want to use the DEFormats Bioconductor package for converting these objects to other formats.


Difference between versions

Version 1: reduced exons

Exon counts are derived from reduced exons, such that each exonic base is only counted once.

Version 2: disjoint exons, released January 12, 2018

Exon counts are derived from disjoint exons, which also result in each exonic based being counted just once. However, disjoint exons are more useful than reduced exons because it is possible to reconstruct the actual exons with these counts. The following code might be helpful to understand the difference.

Version 2: transcript RSE files, May, 2018

Transcript RSE files were re-calculated by Fu et al. For details check the second version of the pre-print.

Raw code:

library("GenomicRanges")
exons <- GRanges("seq", IRanges(start = c(1, 1, 13), end = c(5, 8, 15)))
exons

## Results in 2 reduced exons. Cannot get the counts for exons 1 or 2.
reduce(exons)

## Results in 3 disjoint exons. The sum of disjoint exon 1 and 2 is equal to exon 2.
disjoin(exons)

Output:

exons
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]      seq  [ 1,  5]      *
  [2]      seq  [ 1,  8]      *
  [3]      seq  [13, 15]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

## Results in 2 reduced exons. Cannot get the counts for exons 1 or 2.
reduce(exons)
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]      seq  [ 1,  8]      *
  [2]      seq  [13, 15]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

## Results in 3 disjoint exons. The sum of disjoint exon 1 and 2 is equal to exon 2.
disjoin(exons)
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]      seq  [ 1,  5]      *
  [2]      seq  [ 6,  8]      *
  [3]      seq  [13, 15]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

We realized this in the peer-review process of the recount workflow paper. For backward compatibility, we provide both versions of files. The disjoint exon count files are larger than the reduced exon counts. The gene counts did not change between versions, just a handful of gene symbols, for which we updated the gene files. The gene count text files (counts_gene.tsv.gz) now include the gene ids as an extra column as was requested by several users. The file information document now includes the md5sum for the new files (including the transcript files).


Code used

The code for analyzing the Rail-RNA output and creating this website is available via GitHub.

Known issues

  • Rail-RNA run on SRA data
  • Check the NOTES on the Rail-RNA run on SRA data (version 2) which describes why some samples were discarded. Also check this known issue listing the samples with 0 reads downloaded that are then missing from the RSE files.
  • Single-cell studies with pooled samples
  • Some single-cell studies like SRP058046 sequenced more than one single-cell but are available only as a pooled sample from the Sequence Read Archive. Currently these studies might not be useful for differential expression analysis via recount2 but can be used for checking if a given exon-exon junction is present or if a given region is expressed. Reported by Lukas Simon.
  • BED files have the right end off by 1 base pair
  • Check the R code we used to read in these BED files here. The coordinates in these files are end of exon1 + 1, start of exon2 -1 in zero-based coordinates and should have been end of exon1 + 1, start of exon 2 in zero-based coordinates to be a proper BED file.
    Found an issue?
    If you found an issue with recount2, please email us or describe the problem at the recount-website issue tracker. Thank you!

    FAQ

  • When is the next version going to be released? We are currently (summer 2018) determining what will be the next phase of the recount project. Please stay tuned, thanks!
  • The following R code shows how to use the recount Bioconductor package for downloading data. In this example we will download the data for study SRP009615.

    ## Install recount from Bioconductor
    install.packages("BiocManager")
    BiocManager::install('recount')
    
    ## Browse the vignetets for a quick description of how to use the package
    library('recount')
    browseVignettes('recount')
    
    ## Download the RangedSummarizedExperiment object at the gene level for 
    ## study SRP009615
    url <- download_study('SRP009615')
    
    ## View the url for the file by printing the object url
    url
    
    ## Load the data
    load(file.path('SRP009615', 'rse_gene.Rdata'))
    
    ## Scale counts
    rse <- scale_counts(rse_gene)
    
    ## Then use your favorite differential expression software
    
    ## For more details, check the recount package vignette at
    ## http://bioconductor.org/packages/recount
    
    Not everyone has over 8 terabytes of disk space available to download all the data from the recount2 project. However, thanks to SciServer you can access it locally via a Jupyter Notebook. If you do so and want to share your work, please let the SciServer maintainers know via Twitter at IDIESJHU. The step by step instructions describing how to access recount2 via SciServer are available in the recount vignette.

    Contribute your data to recount2!

    If you are interested in contributing your human RNA-seq data sequenced on the Illumina platform to recount2, please check how to do so at recount-contributions. Thank you!

    Data license

    The data in recount2 is licensed under CC BY 4.0. The legal text can be found here.


    Acknowledgements

    This research was supported by NIH R01 GM105705. LCT was supported by Consejo Nacional de Ciencia y Tecnología México 351535. LCT and AEJ were supported by NIH 1R21MH109956-01. Amazon Web Services experiments were supported by AWS in Education research grants. Storage costs on S3 for TCGA runs were partially covered by a grant from Seven Bridges Genomics for use of the Cancer Genomics Cloud.

    recount2 is hosted on SciServer, a collaborative research environment for large-scale data-driven science. It is being developed at, and administered by, the Institute for Data Intensive Engineering and Science (IDIES) at Johns Hopkins University. SciServer is funded by the National Science Foundation Award ACI-1261715.