Deep Learning Frameworks for Regulatory Genomics

  • Anshul Kundaje of Stanford University
  • A Genomics@JHU Seminar
  • When: September 29, 2015, 10:00
  • Where: The Barber Conference Room at Charles Commons
    10 E 33rd Street
    Baltimore, MD 21218
  • Light refreshments served at 9:30am

Abstract

Deep neural network approaches such as Convolutional Neural Networks (CNNs) and Long-Term Short-Term Recurrent Neural Networks (LSTM-RNNs) have resulted in dramatic performance improvements for several learning tasks in Natural Language Processing, Speech Processing and Computer Vision. We investigate the power of deep learning methods in the context of regulatory genomics and develop novel learning frameworks for integrating key functional genomic data types. Our primary objective is to decipher the relationships between regulatory sequence, transcription factor binding, nucleosome positioning, chromatin accessibility and histone modifications. First, using extensive simulations of regulatory DNA sequence, we evaluate the ability of deep CNNs and CNN-RNNs trained on raw sequence to learn different properties of transcription factor binding sites including probabilistic affinity to sequence motifs, positional and density distributions of motifs, combinatorial sequence grammars involving co-factor sequence preferences with spacing and order constraints. We leverage these architectures in a multi-task setting to learn predictive models of in-vivo TF binding from ChIP-seq data for a large compendium of TFs across multiple cell types and tissues. Our results demonstrate significantly superior generalization performance of deep learning methods, especially CNN-RNNs compared to state-of-the-art approaches for modeling TF binding within and across cell types. We further develop novel methods for model exploration, visualization and feature selection to dissect the heterogeneity of the sequence code underlying direct and indirect TF binding. Next we investigate the relationship between chromatin accessibility, nucleosome positioning and chromatin state (histone marks). We train multi-task, multi-modal CNNs on a novel two-dimensional representation of ATAC-seq data that leverages subtle patterns in insert-size distributions to simultaneously predict multiple histone modifications, combinatorial chromatin state and CTCF binding sites with high accuracy. Models trained on a combination of DNase-seq and MNase-seq data achieve even higher accuracy supporting a fundamental predictive mapping between local chromatin architecture and chromatin state. We use novel feature extraction and visualization methods to peer into the deep neural networks and identify predictive patterns reminiscent of nucleosomal asymmetry and TF footprints. Finally, we will discuss general strategies and easy-to-use software packages for rapid prototyping and learning of optimal deep architectures from functional genomic data.

Speaker Biography

Anshul Kundaje is an Assistant Professor of Genetics and Computer Science at Stanford University and a 2014 Alfred Sloan Fellow. His primary research interest is computational regulatory genomics. His lab develops statistical and machine learning methods for large-scale integrative analysis of diverse functional genomic data to decipher heterogeneity of regulatory elements, uncover their long-range interactions in the context of 3D genome organization, learn transcriptional regulatory network models across cell-types and understand the system-level regulatory impact of non-coding genetic variation. Anshul has previously led the computational analysis efforts of The Encyclopedia of DNA Elements (ENCODE) Project and the Roadmap Epigenomics Project.

Genomics @ JHU Seminar Series

View All Events