StereoGene: Rapid Estimation of Genomewide Correlation of Continuous or Interval Feature Data

Elena D. Stavrovskaya 1,2, Alexander V. Favorov*3,4,5, Tejasvi Niranjan6, Sarah J. Wheelan6 and Andrey Mironov 1,2, [1] Dept. of Bioengineering and Bioinformatics, Moscow State University, Moscow, Russia, [2] Institute for Information Transmission Problems, RAS, Moscow, Russia, [3] Department of Oncology, Division of Biostatistics and Bioinformatics, Sidney Kimmel Comprehensive Cancer Center, The Johns Hopkins University, Baltimore, MD, [4] Laboratory of Systems Biology and Computational Genetics, Vavilov Institute of General Genetics, RAS, Moscow, Russia, [5] Laboratory of Bioinformatics, Research Institute of Genetics and Selection of Industrial Microorganisms, Moscow, Russia, [6] Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, The Johns Hopkins University School of Medicine


Motivation: High throughput sequencing methods produce massive amounts of data. The most common first step in interpretation of these data is to map the data to genomic intervals and then overlap with genome annotations. A major interest in computational genomics is spatial genome-wide correlation among genomic features (e.g. between transcription and histone modification). The key hypothesis here is that features that are similarly distributed along a genome may be functionally related.

Results: Here, we propose a method that rapidly estimates genomewide correlation of genomic annotations; these annotations can be derived from high throughput experiments, databases, or other means. The method goes far beyond the simple overlap and proximity tests that are commonly used, by enabling correlation of continuous data, so that the loss of data that occurs upon reduction to intervals is unnecessary. To include analysis of nonoverlapping but spatially related features, we use kernel correlation. Implementation of this method allows for correlation analysis of two or three profiles across the human genome in a few minutes on a personal computer. Another novel and extraordinarily powerful feature of our approach is the local correlation track output that enables overlap with other correlations (correlation of correlations). We applied our method to the datasets from the Human Epigenome Atlas and FANTOM CAGE. We observed the changes of the correlation between epigenomic features across developmental trajectories of several tissue types, and found unexpected strong spatial correlation of CAGE clusters with splicing donor sites and with poly(A) sites.

Availability: The StereoGene C++ source code, program documentation, Galaxy integration scrips and examples are available at the project homepage at