Recipients of the IDIES Seed Funding Awards

The IDIES Seed Funding Program Awards are competitive awards of $25,000. The the Seed Funding initiative provided funding to the following data-intensive computing projects because they (a) involve areas relevant to IDIES and JHU institutional research priorities, (b) are multidisciplinary; and (c) build ideas and teams with good prospects for successful proposals to attract external research support by leveraging IDIES intellectual and physical infrastructure.

Spring 2017

Variational Bayes Gene Activity in Pattern Sets (VB-GAPS) bioinformatics algorithm for efficient precision medicine in oncology

Elana J. Fertig (Department of Oncology, School of Medicine), Raman Arora (Department of Computer Science, Whiting School of Engineering)

Currently, scientists have unprecedented access to a wide variety of high quality datasets which are collected from independent studies. However, standardized annotations are essential to perform meta analyses, and this presents a problem as standards are often not used. Accurately combining records from diverse studies requires tedious and error-prone human curation, posing a significant time and cost barrier.

We propose a novel natural language processing (NLP) algorithm, Synthesize, that merges data annotations automatically and is part of an open source web application, Synthesizer, that allows the user to easily interact with merged data visually. The Synthesize algorithm was used to merge varying cancer datasets and to also merge ecological datasets. The algorithm demonstrated high accuracy (on the order of 85-100%) when compared to manually merged data.

EchoSIM: Multiscale Acoustic Simulations Integrated with Free-Flight Experiments for Echo Scene Analysis of an Echolocating Bat

Rajat Mittal (Department of Mechanical Engineering), Jung Hee Seo (Department of Mechanical Engineering), Cynthia F. Moss (Psychological and Brain Sciences), Susanne J. Sterbing-D’Angelo (Psychological and Brain Sciences)

Animals that rely on active sensing provide a powerful system to investigate the neural underpinnings of natural scene representation, as they produce the very signals that inform motor actions. Echolocating bats, for example, transmit sonar signals and process auditory information carried by returning echoes to guide behavioral decisions for spatial orientation. Bats compute the direction of objects from differences in echo intensity, spectrum, and timing at the two ears; while an object’s distance is measured from the time delay between sonar emission and echo return. Together, this acoustic information gives rise to a 3D representation of the world through sound, and measurements of sonar calls and echoes provide explicit data on the signals available to the bat for orienting in space.

In the present seed funding program, we propose to develop a first-of-its-kind computational simulation-enabled method for echo scene analysis of an echolocating bat, which is based on acoustic simulations (we refer to this method as “EchoSIM”). The proposed method integrates tightly with free-flight laboratory assays of bats and takes as input, variables such as the bat’s flight path, hear-ear anatomy, position and orientation as well as the sonar call wave form. The simulation results (3D echo scene and echo signal) together with the experimental measurements will provide a unique and powerful integrated dataset that enable unprecedented analysis of active sensing and adaptive flight behavior of bats in complex environments.

An Iterative Approach to Integrating Environmental Genomics into Biogeochemical Models

Sarah Preheim (Department of Environmental Health and Engineering), Anand Gnanadesikan ( Department of Earth and Planetary Sciences)

Environmental policy is increasingly based on results from computer simulations, but more integration between models and observations is needed to make sound decisions. For example, the Environmental Protection Agency (EPA)regularly uses models to set the total maximum daily load (TMDL) limits for nutrients entering watersheds, such as the Chesapeake Bay, with the goal of making all waterways in the US fishable and swimmable under the Clean Water Act. Predictions used for policy decisions are typically informed by a series of models, refined by observations and represent input from a variety of scientists.

We propose to optimize the integration of sequence-based approaches into biogeochemical models, with specific application to ChesROMs, a model of the Chesapeake Bay Dead-zone. Run-off from agricultural and urban areas pollutes the Bay surface waters with nitrogen and phosphorous. This pollution drives harmful algal blooms that have devastating consequences on the ecosystems and threaten public health. One major consequence of pollution is the development of oxygen-free (anoxic) or reduced oxygen (hypoxic) dead-zones that deteriorate the habitat for many aquatic animals. An interdisciplinary approach to this problem is essential as the physical environment and microbial processes components are inextricably linked. Physical stratification within the water column, based on salinity and temperature gradients, determine the extent of vertical mixing between the upper and lower water bodies. Microbial processes are sensitive to mixing, adjusting not only growth, but the specific metabolic pathways, based on the amount of mixing. Denitrification and dissimilatory nitrate reduction to ammonia are two processes that can be very sensitive to the physical environment, yet which determines the fate of nitrogen that fuels algal growth. Integrating an understanding of the physical environmental and microbial processes is vital for improved predictions.

New Tools for an Old Problem: Building a Global and Historical Data Set of Social Unrest

Beverly J. Silver (Professor and Chair, Sociology Department; Director, Arrighi Center for Global Studies), Sahan Savas Karatasli (Sociology and Arrighi Center for Global Studies), Christopher Nealon (Professor and Chair, English Department)

The purpose of the seed proposal is to develop methods to semi-automate the collection of data on protest and other events from newspapers and similar sources with the goal of both reducing the time and increasing the accuracy for coding event information (e.g., location, actors, actions, demands). Most existing social science research in this area automate the data collection process, but do so at the cost of including an unacceptable level of false positives and failing to take advantage of the rich detailed information provided in the newspaper articles themselves. Our current NSF-funded research on Global Social Protest uses search strings to extract relevant articles from the digitized newspaper archives and relies on a custom-built website for data coding and analysis; however, to avoid the above-mentioned pitfalls it relies on human coding of articles (which is time consuming). The seed project seeks to develop natural language processing tools that allow for a middle path between full automation and manual coding. In addition to English language newspapers, we will run pilots on French, Japanese, Korean and Spanish newspapers. The extension of the project to other languages allows us to widen and deepen ongoing international research collaborations.

Spring 2016

Towards the Johns Hopkins Ocean Circulation DataBase: Method Development and Prototype

Thomas Haine (Earth and Planetary Sciences), Gerard Lemson (Physics & Astronomy)

This seed grant project will pave the way to implementation of an online benchmark ocean circulation solution. In the seed grant we will develop methods and protocols and implement a prototype solution with much smaller data size. The target analytics services are:

  • Extraction of sub-spaces of the solution state vector.
  • Computation of statistics on the extracted sub-spaces, like time series of heat content in a control volume.
  • Computation of oceanographic diagnostics like fluxes of volume, heat, and momentum.
  • Computation of conditional statistics, like the temperature on a surface conditioned on strong volume flux.
  • Computation of Lagrangian particle trajectories starting from arbitrary initial locations.

Development of IOS App and AWS backed for New Data on Metabolic Syndrome

Jeanne Clark (SOM – General Internal Medicine), Thomas Woolf (Physiology & Computer Science), Yanif Ahmad (Computer Science)

Dr. Clark’s team is building ‘Metabolic Compass,’ a mobile health stack for investigating circadian rhythms and how our temporal decisions influence near- and long-term health. By tracking when people eat, when they sleep, and when they exercise through Apple’s HealthKit, they will collect a rich, open dataset for studying time restricted feeding and intermittent fasting. Their data will allow users to ask and answer personalized health questions, such as “How much time should I leave between exercising and eating?” or “How early should I eat dinner before going to bed?”. Users will consent through Apple’s ResearchKit, enter data through activity trackers (e.g., FitBit, Jawbone) and third party apps (e.g., MyFitnessPal, Argus), and compare their health against our population through our AWS cloud services. In addition to deploying on iOS, Dr. Clark’s team will explore an Android App to expand our user base during this proposal.

Fusion Transcripts Bridge Chromatic Loops to Create Novel Proteins

Sarah J. Wheelan, MD, PhD, (Institute of Genetic Medicine) and Michael C. Schatz, PhD, (Department of Computer Science)

The non-contiguous nature of eukaryotic coding sequences generates immense protein and RNA diversity from one gene, and poses a challenge for scientists investigating gene function. Short-read sequencing captures tiny snapshots of the immense combinatorial problem; thus, we have likely identified only a small fraction of the functional transcripts in any cell. A novel mechanism is possible: chromatin structure places genes in physical proximity and creates opportunities for RNA-level rearrangements, without corresponding DNA rearrangements. These have been reported anecdotally and would be a mechanism for creating immense transcript diversity. Such transcripts may be detectable only in large and validated datasets, by fast and sensitive algorithms. Longer-read technology, well known to our group, may also be employed.

Data Analytics of Enormous Graphs: From Theory to Practice

Vladimir Braverman, PhD, (Department of Computer Science) and Carey Priebe, Professor, (Applied Mathematics and Statistics)

This research will aim to deliver new streaming tools to statistical inference on massive graphs as well as address some basic questions in statistics such as hypothesis testing. According to Dr Braverman, the preliminary results indicate that this direction is promising. In particular, it will be able to distinguish between Erdos-Renyi and Kidney-and-Egg random graphs. This novel approach is based on efficient computations of largest eigenvalues of streaming graphs. Dr. Braverman states “We use a combination of measure of concentration tools with streaming algorithms for linear algebra, and we plan to extend these results to more general distributions and submit a white paper in August.”

Spring 2015

Genome-wide Prediction of DNase I Hypersensitivity and Transcription Factor Binding Sites Based on Gene Expression

Hong Kai Ji (Biostatistics), Ted Dawson (Neurology (SOM), Neurology (JHH), Neuroscience (SOM)), Valina Dawson (Neurology (SOM), Neuroscience (SOM), Physiology (SOM))

In this project the investigators will develop a data science approach for studying global gene regulation. They will utilize massive amounts of publicly available functional genomic data to build computational models to predict genome-wide cis-regulatory element activities based on gene expression data. The investigators will develop new high-dimensional regression and prediction methods for big data and test the feasibility of predicting cis-regulatory element activities in samples where the available material is insufficient for conventional ChIP-seq and DNase-seq experiments.

Cost-Sensitive Prediction: Applications in Healthcare

Daniel Robinson (Dept. of Applied Mathematics & Statistics), Suchi Saria (Dept. of Computer Science)

Advances in model prediction for problems that have a non-trivial cost structure are needed. In healthcare, the financial, nurse time, and wait time costs share a complicated dependency with the clinical measurements needed and medical tests performed. In 2014, the healthcare budget in the United States came to 17% of GDP with a total annual expenditure of $3.1 trillion dollars. It is estimated that between one-fourth and one-third of this amount was unnecessary, with most attributed to avoidable testing and diagnostic costs. Therefore, the design of new cost-sensitive models that faithfully reflect the preferences of a user is paramount. We will develop such models and new optimization algorithms to solve them that give better predictions at lower costs, incorporate a patient’s preferences, and assist in personalized healthcare.

Statistical Methods for Real-Time Monitoring of Physical Disability in Multiple Sclerosis

Vadim Zipunnikov (Biostatistics), Kathleen Zackowski (Motion Analysis Lab)

The lack of sensitive outcomes capable of detecting progression of Multiple Sclerosis (MS) is a primary limitation to the development of newer therapies. Wearables provide real-time objective measurement of physical activity of MS patients in a real-world context. We put forward a novel statistical framework that simultaneously characterizes multiple features of physical activity profiles over the course of a day as well as their day-to-day dynamics. The proposed framework will allow MS researchers to identify physical activity signatures that will distinguish between individuals with different MS types and will help to understand physical activity differences in disability progression.​

Fall 2014

Urban Planning in Baltimore City

Tamas Budavari (Dept. of Applied Mathematics & Statistics), Kathryn Edin (Dept. of Sociology), and Michael Braverman (Dept. of Housing & Community Development, Housing Authority of Baltimore City)

Our Vacant Housing Dynamics in Baltimore City Project aims to improve the quality of city life by integrating data-driven science with redevelopment-policy and administration. Working with City officials, our goal is to better understand the dynamics of vacant housing in Baltimore City, measure the impact of current interventions, and hone decision- and policy-making with statistical analyses of available data. Addressing the vacancy crisis is essential to attracting and retaining people in Baltimore, a key goal formalized in the Grow Baltimore program.

Towards a Global, Streaming Data Exploration Testbed in Astrophysics

Brice Menard (Dept. of Physics & Astronomy), Yanif Ahmad (Dept. of Computer Science), and Raman Arora (Dept. of Computer Science)

The astronomical data space has dramatically increased over the past fifteen years, thanks to detector technology and space-based observations opening up new wavelengths channels. Surprisingly, attempts to characterize and represent the data globally have been rather limited. With this project, we propose to: (i) identify a standard set of operations to look globally at datasets; (ii) explore the potential of various techniques used in statistics and Machine Learning; (iii) define and build efficient tools to conducting global data exploration given one dataset or a combination of them. The goal of this project is to develop a preliminary package allowing a user to perform global data exploration and gain knowledge on the content of the data space.

A Modeling Enabled Database for Aneurysm Hemodynamics and Risk Stratification

Jung Hee Seo (Dept. of Mechanical Engineering), Rajat Mittal (Dept. of Mechanical Engineering), Rafael Tamargo (Dept. of Neurosurgery & Otolaryngology), and Justin Caplan (Dept. of Neurosurgery)

Prompt and accurate stratification of rupture risk is the “holy-grail” in treating intracranial aneurysms. Physics-based computational models of aneurysm biomechanics including the simulation of blood flow field and its effect on the vascular structures hold great promise in this context, but large sample sizes are essential for developing insights and reliable statistical correlations/metrics for the rupture risk. In this project, we will develop computational modeling approaches designed from ground-up to process large sample sizes of patient data, that are essential to develop the computer-aided risk stratification method.

Optimized Empirical-statistical Downscaling of Global Climate Model Ensembles for Climate Change Impacts Analysis

Benjamin Zaitchik (Dept. of Earth & Planetary Sciences), Seth Guikema (Dept. of Geography & Engineering ), and Dr. Sharon Gourdji (International Center for Tropical Agriculture (CIAT) Cali, Colombia)

One of the greatest challenges in climate science today is the call to provide actionable information for adaptation to climate change. This is a particularly difficult problem because Global Climate Models (GCMs) are poorly suited for predicting climate impacts of interest at local scale. This means that GCM projections must be “downscaled” to the local environment, often through statistical methods. This seed grant is motivated by the recognition that existing statistical downscaling systems suffer from subjective and incomplete selection of predictor fields. To address this limitation we are implementing an automated statistical downscaling system that employs a combination of optimization and statistical learning theory driven predictive modeling. This system will generate predictive models informed by multiple modeling approaches and a diverse and expandable library of gridded predictor fields.

Spring 2014

SIRENIC: Stream Infrastructure for the Real-time Analysis of Intensive Care Unit Sensor Data

Yanif Ahmad, (Dept. of Computer Science), Raimond Winslow (Dept. Biomedical Engineering), and Yair Amir, (Dept. of Computer Science)

We are designing Sirenic as open-source data streaming infrastructure for the real-time analysis of patient physiological data in intensive care units. Sirenic exploits systems specialization and scaling capabilities enabled by our K3 declarative systems compilation framework to realize orders of magnitude data throughput gains over current generation stream and database systems. Our proposal aims at delivering a proof-of-concept data collection and analysis pipeline to support exploratory research activities in ICU healthcare, with the explicit capability to operate on live data and to empower alarms research and event detection in the real-time setting.

Alignment to The Cancer Genome Atlas Project Raw Sequencing Reads (8948 Samples and Counting)

Sarah Wheelan, (Dept. of Oncology) and Srinivasan Yegnasubramanian, (Dept. of Oncology)

Alignment to The Cancer Genome Atlas Project Raw Sequencing Reads: With skyrocketing numbers of whole genome sequence and phenotype data available from individuals’ germline and diseased cells, we need a new framework for understanding genomics data. Using the Data-Scope (a data-intensive supercomputer, funded by the NSF), we aim to detect sets of nucleotide-level variations that best classify given phenotypes. Next, we can find covarying or spatially correlated genomic variations across the entire dataset or within phenotypes. Our final goal, and the most powerful application of these data and algorithms, is to use unsupervised methods to delineate genomic variants that discriminate subsets of the data, without regard to phenotypes.

The Elusive Onset of Turbulence And The Laminar-Turbulence Interface

Tamer A. Zaki (Dept. of Mechanical Engineering) and Gregory Eyink (Applied Math and Statistics)

The Elusive Onset of Turbulence And The Laminar-Turbulence Interface: The onset of chaotic fluid motion from an initially laminar, organized state is an intriguing phenomenon referred to as laminar-to-turbulence transition. Early stages involve the amplification of seemingly innocuous small-amplitude perturbations. Once these disturbances reach appreciable amplitudes, they become host to sporadic bursts of turbulence — a chaotic state whose complexity is only tractable by high-fidelity large-scale simulations. By performing direct numerical simulations that resolve the dynamics of laminar-to-turbulence transition in space and time, and storing the full history of the flow evolution, we capture the rare high-amplitude events that give way to turbulence and unravel key characteristics of the laminar-turbulence interface.

Highly Scalable Software for Analyzing Large Collections of RNA Sequencing Data

Ben Langmead, PhD (Dept. of Computer Science) and Jeffrey Leek, PhD (Dept. of Biostatistics)

Highly Scalable Software for Analyzing Large Collections of RNA Sequencing Data: We are developing a radically scalable software tool, Rail-RNA, for analysis of large RNA sequencing datasets. Rail-RNA will make it easy for researchers to re-analyze published RNA-seq datasets. It will be designed to analyze many datasets at once, applying an identical analysis method to each so that results are comparable. This enables researchers to perform several critical scientific tasks that are currently difficult, including (a) reproducing results from previous large RNA-seq studies, (b) comparing datasets while avoiding bioinformatic variability, (c) studying systematic biases and other effects (e.g lab and batch effects) that can confound conclusions when disparate datasets are combined.

FragData—High-fidelity Data on Dynamic Fragmentation of Brittle Materials

Nitin Daphalapurkar (Dept. of Mechanical Engineering), and Lori Graham-Brady (Dept. of Civil Engineering)

Professors Daphalapurkar and Graham-Brady of Hopkins Extreme Materials Institute are constructing a massive dynamic-fragmentation database (FragData) for materials undergoing failure in critical applications. They envisage FragData would help expand understanding on the mechanics of failure processes associated with, for example, disruption of asteroids, fragmentation of protection materials under impact, and debris formation of construction materials under catastrophic loading. The idea is to have the database openly accessible, have tools to carry out in situ analysis, and have the database serve as a central platform for other researchers to interpret the massive data from state-of-the-art particle-based and finite-element-based simulation techniques.