recount2 is an online resource consisting of RNA-seq gene and exon counts as well as coverage bigWig files for 2041 different studies. It is the second generation of the ReCount project. The raw sequencing data were processed with Rail-RNA as described in the recount2 paper and at Nellore et al, Genome Biology, 2016 which created the coverage bigWig files. For ease of statistical analysis, for each study we created count tables at the gene and exon levels and extracted phenotype data, which we provide in their raw formats as well as in RangedSummarizedExperiment R objects (described in the SummarizedExperiment Bioconductor package). We also computed the mean coverage per study and provide it in a bigWig file, which can be used with the derfinder Bioconductor package to perform annotation-agnostic differential expression analysis at the expressed regions-level as described at Collado-Torres et al, Nucleic Acids Research, 2017. The count tables, RangedSummarizeExperiment objects, phenotype tables, sample bigWigs, mean bigWigs, and file information tables are ready to use and freely available here. We also created the recount Bioconductor package which allows you to search and download the data for a specific study. By taking care of several preprocessing steps and combining many datasets into one easily-accessible website, we make finding and analyzing RNA-seq data considerably more straightforward.
If you need help with the recount Bioconductor package please get in touch via the Bioconductor support website (remember to use the recount tag). Please check this post on how to ask for help for Bioconductor packages. For support on reproducing the recount2 project, please get in touch via the recount-website repository.
This tab shows the information for the GTEx project. Due to its size, we also provide ranged summarized experiment objects (RSE) by tissue at the gene and exon levels.
This tab shows the information for the TCGA project. Due to its size, we also provide ranged summarized experiment objects (RSE) by tissue at the gene and exon levels.
All columns of the table below are sortable and searchable. The columns are as follows:
The SRA accession identifier for the study. The link points to SRA for a full description of the study.
The total number of samples available for the given study. Note that in some exceptional cases not all samples for a given study were analyzed with Rail-RNA.
The species of the samples under study.
The abstract describing the study as provided by SRA via the SRAdb Bioconductor package.
The RangedSummarizedExperiment object for the counts summarized at the gene level using the Gencode v25 (GRCh38.p7, CHR) annotation as provided by Gencode. Note that the GRanges object recount::recount_genes includes bp_length as a metadata column which is the sum of the exon widths. If two exons are overlapping, the overlapped bases are only counted once.
A tsv file with the count matrix used to create the RangedSummarizedExperiment object at the gene level. Version 2 files include the gene ids in an extra column.
The RangedSummarizedExperiment object for the counts summarized at the exon level using the Gencode v25 (GRCh38.p7, CHR) annotation as provided by Gencode. This GRangesList object recount::recount_exons has 1 element per gene. For each gene this object contains the reduced exons (version 1) or disjoint exons (version 2) such that they are non-overlapping within a gene. See the version section below for more information.
A tsv file with the count matrix used to create the RangedSummarizedExperiment object at the exon level.
The RangedSummarizedExperiment object for the counts summarized at the junction level. This GRanges object has 1 element per junction. For each junction this object contains the transcript names for the Gencode v24 junctions, transcript names and gene ids matching those used in the RSE objects at the gene or exon levels (based on Gencode v25 CHR regions), junction class, proposed gene ids and symbols. The junction ids match those used in the jx_cov and jx_bed files explained below. This file is present only if the project has at least one junction detected.
A tsv file with the count matrix used to create the RangedSummarizedExperiment object at the junction level. This file is present only if the project has at least one junction detected.
A RangedSummarizedExperiment object with the transcript quantifications as done by Fu et al, bioRxiv, 2018.
The link jx_bed points to the BED file with one entry per junction present in the given project. The name of the junction includes the junction id, the donor, acceptor, and overall (if present in both) transcript names based on Gencode v24. See the known issues for an important detail on these BED files.
The phenotype information (sample metadata) in a tsv file used for both RangedSummarizedExperiment objects. The table includes the SRA study id, the SRA sample id, the SRA experiment id, the SRA run id, the reads counts as reported by SRA, the number of reads aligned by Rail-RNA, the proportion of reads reported by SRA that aligned, whether the sample was paired-end or not, whether we think SRA misreported the paired-end label, the number of mapped read count by Rail-RNA, the coverage AUC, the SHARQ prototype tissue, the SHARQ prototype cell type, the biosample sumbmission date, the biosample publication date, the biosample update date, the average read length, the GEO accession id, the sample title as extracted from GEO, the sample characteristics as extracted from GEO, and the name of the coverage bigWig file.
A tsv file with the names of the uploaded files, the md5sum, the size in bytes and the url to download the actual files. Version 2 includes the md5sum for the transcript files.
The Gencode v25 GFF3 file with the comprehensive gene annotation (CHR regions) is available at the Gencode website.
RangedSummarizedExperiment object with the FANTOM-CAT/recount2 annotation counts. See Imada, Sanchez et al, bioRxiv, 2019 for more information on how this annotation was determined. The rowRanges() slot is a GRangesList with one element per gene, and then exons coordinates for that given gene. For the md5sum and file size information for these files, please check the fc_rc_files_info.tsv file.
Check out the recount Bioconductor package for how to download data from the recount2 project and get started with your own analysis. The quick start vignette is particularly helpful. For more details check the SummarizedExperiment Bioconductor package for an overview on RangedSummarizedExperiment objects. Most RNA-seq differential expression Bioconductor packages use them. In particular, the vignette of the DESeq2 Bioconductor package shows in detail how to perform an analysis with them. You might also want to use the DEFormats Bioconductor package for converting these objects to other formats.
Exon counts are derived from reduced exons, such that each exonic base is only counted once.
Exon counts are derived from disjoint exons, which also result in each exonic based being counted just once. However, disjoint exons are more useful than reduced exons because it is possible to reconstruct the actual exons with these counts. The following code might be helpful to understand the difference.
Transcript RSE files were re-calculated by Fu et al. For details check the second version of the pre-print.
Raw code:
library("GenomicRanges")
exons <- GRanges("seq", IRanges(start = c(1, 1, 13), end = c(5, 8, 15)))
exons
## Results in 2 reduced exons. Cannot get the counts for exons 1 or 2.
reduce(exons)
## Results in 3 disjoint exons. The sum of disjoint exon 1 and 2 is equal to exon 2.
disjoin(exons)
Output:
exons
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] seq [ 1, 5] *
[2] seq [ 1, 8] *
[3] seq [13, 15] *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
## Results in 2 reduced exons. Cannot get the counts for exons 1 or 2.
reduce(exons)
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] seq [ 1, 8] *
[2] seq [13, 15] *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
## Results in 3 disjoint exons. The sum of disjoint exon 1 and 2 is equal to exon 2.
disjoin(exons)
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] seq [ 1, 5] *
[2] seq [ 6, 8] *
[3] seq [13, 15] *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
We realized this in the peer-review process of the recount workflow paper. For backward compatibility, we provide both versions of files. The disjoint exon count files are larger than the reduced exon counts. The gene counts did not change between versions, just a handful of gene symbols, for which we updated the gene files. The gene count text files (counts_gene.tsv.gz) now include the gene ids as an extra column as was requested by several users. The file information document now includes the md5sum for the new files (including the transcript files).
The following R code shows how to use the recount Bioconductor package for downloading data. In this example we will download the data for study SRP009615.
## Install recount from Bioconductor
install.packages("BiocManager")
BiocManager::install('recount')
## Browse the vignetets for a quick description of how to use the package
library('recount')
browseVignettes('recount')
## Download the RangedSummarizedExperiment object at the gene level for
## study SRP009615
url <- download_study('SRP009615')
## View the url for the file by printing the object url
url
## Load the data
load(file.path('SRP009615', 'rse_gene.Rdata'))
## Scale counts
rse <- scale_counts(rse_gene)
## Then use your favorite differential expression software
## For more details, check the recount package vignette at
## http://bioconductor.org/packages/recount
The data in recount2 is licensed under CC BY 4.0. The legal text can be found here.
This research was supported by NIH R01 GM105705. LCT was supported by Consejo Nacional de Ciencia y Tecnología México 351535. LCT and AEJ were supported by NIH 1R21MH109956-01. Amazon Web Services experiments were supported by AWS in Education research grants. Storage costs on S3 for TCGA runs were partially covered by a grant from Seven Bridges Genomics for use of the Cancer Genomics Cloud.
recount2 is hosted on SciServer, a collaborative research environment for large-scale data-driven science. It is being developed at, and administered by, the Institute for Data Intensive Engineering and Science (IDIES) at Johns Hopkins University. SciServer is funded by the National Science Foundation Award ACI-1261715.