Recipients of the IDIES Seed Funding Awards

The IDIES Seed Funding Program Awards are competitive awards of $25,000. The the Seed Funding initiative provided funding to the following data-intensive computing projects because they (a) involve areas relevant to IDIES and JHU institutional research priorities, (b) are multidisciplinary; and (c) build ideas and teams with good prospects for successful proposals to attract external research support by leveraging IDIES intellectual and physical infrastructure.

Spring 2016

Towards the Johns Hopkins Ocean Circulation DataBase: Method Development and Prototype

Thomas Haine (Earth and Planetary Sciences), Gerard Lemson (Physics & Astronomy)

This seed grant project will pave the way to implementation of an online benchmark ocean circulation solution. In the seed grant we will develop methods and protocols and implement a prototype solution with much smaller data size. The target analytics services are:

  • Extraction of sub-spaces of the solution state vector.
  • Computation of statistics on the extracted sub-spaces, like time series of heat content in a control volume.
  • Computation of oceanographic diagnostics like fluxes of volume, heat, and momentum.
  • Computation of conditional statistics, like the temperature on a surface conditioned on strong volume flux.
  • Computation of Lagrangian particle trajectories starting from arbitrary initial locations.

Development of IOS App and AWS backed for New Data on Metabolic Syndrome

Jeanne Clark (SOM – General Internal Medicine), Thomas Woolf (Physiology & Computer Science), Yanif Ahmad (Computer Science)

Dr. Clark’s team is building ‘Metabolic Compass,’ a mobile health stack for investigating circadian rhythms and how our temporal decisions influence near- and long-term health. By tracking when people eat, when they sleep, and when they exercise through Apple’s HealthKit, they will collect a rich, open dataset for studying time restricted feeding and intermittent fasting. Their data will allow users to ask and answer personalized health questions, such as “How much time should I leave between exercising and eating?” or “How early should I eat dinner before going to bed?”. Users will consent through Apple’s ResearchKit, enter data through activity trackers (e.g., FitBit, Jawbone) and third party apps (e.g., MyFitnessPal, Argus), and compare their health against our population through our AWS cloud services. In addition to deploying on iOS, Dr. Clark’s team will explore an Android App to expand our user base during this proposal.

Fusion Transcripts Bridge Chromatic Loops to Create Novel Proteins

Sarah J. Wheelan, MD, PhD, (Institute of Genetic Medicine) and Michael C. Schatz, PhD, (Department of Computer Science)

The non-contiguous nature of eukaryotic coding sequences generates immense protein and RNA diversity from one gene, and poses a challenge for scientists investigating gene function. Short-read sequencing captures tiny snapshots of the immense combinatorial problem; thus, we have likely identified only a small fraction of the functional transcripts in any cell. A novel mechanism is possible: chromatin structure places genes in physical proximity and creates opportunities for RNA-level rearrangements, without corresponding DNA rearrangements. These have been reported anecdotally and would be a mechanism for creating immense transcript diversity. Such transcripts may be detectable only in large and validated datasets, by fast and sensitive algorithms. Longer-read technology, well known to our group, may also be employed.

Data Analytics of Enormous Graphs: From Theory to Practice

Vladimir Braverman, PhD, (Department of Computer Science) and Carey Priebe, Professor, (Applied Mathematics and Statistics)

This research will aim to deliver new streaming tools to statistical inference on massive graphs as well as address some basic questions in statistics such as hypothesis testing. According to Dr Braverman, the preliminary results indicate that this direction is promising. In particular, it will be able to distinguish between Erdos-Renyi and Kidney-and-Egg random graphs. This novel approach is based on efficient computations of largest eigenvalues of streaming graphs. Dr. Braverman states “We use a combination of measure of concentration tools with streaming algorithms for linear algebra, and we plan to extend these results to more general distributions and submit a white paper in August.”

Spring 2015

Genome-wide Prediction of DNase I Hypersensitivity and Transcription Factor Binding Sites Based on Gene Expression

Hong Kai Ji (Biostatistics), Ted Dawson (Neurology (SOM), Neurology (JHH), Neuroscience (SOM)), Valina Dawson (Neurology (SOM), Neuroscience (SOM), Physiology (SOM))

In this project the investigators will develop a data science approach for studying global gene regulation. They will utilize massive amounts of publicly available functional genomic data to build computational models to predict genome-wide cis-regulatory element activities based on gene expression data. The investigators will develop new high-dimensional regression and prediction methods for big data and test the feasibility of predicting cis-regulatory element activities in samples where the available material is insufficient for conventional ChIP-seq and DNase-seq experiments.

Cost-Sensitive Prediction: Applications in Healthcare

Daniel Robinson (Dept. of Applied Mathematics & Statistics), Suchi Saria (Dept. of Computer Science)

Advances in model prediction for problems that have a non-trivial cost structure are needed. In healthcare, the financial, nurse time, and wait time costs share a complicated dependency with the clinical measurements needed and medical tests performed. In 2014, the healthcare budget in the United States came to 17% of GDP with a total annual expenditure of $3.1 trillion dollars. It is estimated that between one-fourth and one-third of this amount was unnecessary, with most attributed to avoidable testing and diagnostic costs. Therefore, the design of new cost-sensitive models that faithfully reflect the preferences of a user is paramount. We will develop such models and new optimization algorithms to solve them that give better predictions at lower costs, incorporate a patient’s preferences, and assist in personalized healthcare.

Statistical Methods for Real-Time Monitoring of Physical Disability in Multiple Sclerosis

Vadim Zipunnikov (Biostatistics), Kathleen Zackowski (Motion Analysis Lab)

The lack of sensitive outcomes capable of detecting progression of Multiple Sclerosis (MS) is a primary limitation to the development of newer therapies. Wearables provide real-time objective measurement of physical activity of MS patients in a real-world context. We put forward a novel statistical framework that simultaneously characterizes multiple features of physical activity profiles over the course of a day as well as their day-to-day dynamics. The proposed framework will allow MS researchers to identify physical activity signatures that will distinguish between individuals with different MS types and will help to understand physical activity differences in disability progression.​

Fall 2014

Urban Planning in Baltimore City

Tamas Budavari (Dept. of Applied Mathematics & Statistics), Kathryn Edin (Dept. of Sociology), and Michael Braverman (Dept. of Housing & Community Development, Housing Authority of Baltimore City)

Our Vacant Housing Dynamics in Baltimore City Project aims to improve the quality of city life by integrating data-driven science with redevelopment-policy and administration. Working with City officials, our goal is to better understand the dynamics of vacant housing in Baltimore City, measure the impact of current interventions, and hone decision- and policy-making with statistical analyses of available data. Addressing the vacancy crisis is essential to attracting and retaining people in Baltimore, a key goal formalized in the Grow Baltimore program.

Towards a Global, Streaming Data Exploration Testbed in Astrophysics

Brice Menard (Dept. of Physics & Astronomy), Yanif Ahmad (Dept. of Computer Science), and Raman Arora (Dept. of Computer Science)

The astronomical data space has dramatically increased over the past fifteen years, thanks to detector technology and space-based observations opening up new wavelengths channels. Surprisingly, attempts to characterize and represent the data globally have been rather limited. With this project, we propose to: (i) identify a standard set of operations to look globally at datasets; (ii) explore the potential of various techniques used in statistics and Machine Learning; (iii) define and build efficient tools to conducting global data exploration given one dataset or a combination of them. The goal of this project is to develop a preliminary package allowing a user to perform global data exploration and gain knowledge on the content of the data space.

A Modeling Enabled Database for Aneurysm Hemodynamics and Risk Stratification

Jung Hee Seo (Dept. of Mechanical Engineering), Rajat Mittal (Dept. of Mechanical Engineering), Rafael Tamargo (Dept. of Neurosurgery & Otolaryngology), and Justin Caplan (Dept. of Neurosurgery)

Prompt and accurate stratification of rupture risk is the “holy-grail” in treating intracranial aneurysms. Physics-based computational models of aneurysm biomechanics including the simulation of blood flow field and its effect on the vascular structures hold great promise in this context, but large sample sizes are essential for developing insights and reliable statistical correlations/metrics for the rupture risk. In this project, we will develop computational modeling approaches designed from ground-up to process large sample sizes of patient data, that are essential to develop the computer-aided risk stratification method.

Optimized Empirical-statistical Downscaling of Global Climate Model Ensembles for Climate Change Impacts Analysis

Benjamin Zaitchik (Dept. of Earth & Planetary Sciences), Seth Guikema (Dept. of Geography & Engineering ), and Dr. Sharon Gourdji (International Center for Tropical Agriculture (CIAT) Cali, Colombia)

One of the greatest challenges in climate science today is the call to provide actionable information for adaptation to climate change. This is a particularly difficult problem because Global Climate Models (GCMs) are poorly suited for predicting climate impacts of interest at local scale. This means that GCM projections must be “downscaled” to the local environment, often through statistical methods. This seed grant is motivated by the recognition that existing statistical downscaling systems suffer from subjective and incomplete selection of predictor fields. To address this limitation we are implementing an automated statistical downscaling system that employs a combination of optimization and statistical learning theory driven predictive modeling. This system will generate predictive models informed by multiple modeling approaches and a diverse and expandable library of gridded predictor fields.

Spring 2014

SIRENIC: Stream Infrastructure for the Real-time Analysis of Intensive Care Unit Sensor Data

Yanif Ahmad, (Dept. of Computer Science), Raimond Winslow (Dept. Biomedical Engineering), and Yair Amir, (Dept. of Computer Science)

We are designing Sirenic as open-source data streaming infrastructure for the real-time analysis of patient physiological data in intensive care units. Sirenic exploits systems specialization and scaling capabilities enabled by our K3 declarative systems compilation framework to realize orders of magnitude data throughput gains over current generation stream and database systems. Our proposal aims at delivering a proof-of-concept data collection and analysis pipeline to support exploratory research activities in ICU healthcare, with the explicit capability to operate on live data and to empower alarms research and event detection in the real-time setting.

Alignment to The Cancer Genome Atlas Project Raw Sequencing Reads (8948 Samples and Counting)

Sarah Wheelan, (Dept. of Oncology) and Srinivasan Yegnasubramanian, (Dept. of Oncology)

Alignment to The Cancer Genome Atlas Project Raw Sequencing Reads: With skyrocketing numbers of whole genome sequence and phenotype data available from individuals’ germline and diseased cells, we need a new framework for understanding genomics data. Using the Data-Scope (a data-intensive supercomputer, funded by the NSF), we aim to detect sets of nucleotide-level variations that best classify given phenotypes. Next, we can find covarying or spatially correlated genomic variations across the entire dataset or within phenotypes. Our final goal, and the most powerful application of these data and algorithms, is to use unsupervised methods to delineate genomic variants that discriminate subsets of the data, without regard to phenotypes.

The Elusive Onset of Turbulence And The Laminar-Turbulence Interface

Tamer A. Zaki (Dept. of Mechanical Engineering) and Gregory Eyink (Applied Math and Statistics)

The Elusive Onset of Turbulence And The Laminar-Turbulence Interface: The onset of chaotic fluid motion from an initially laminar, organized state is an intriguing phenomenon referred to as laminar-to-turbulence transition. Early stages involve the amplification of seemingly innocuous small-amplitude perturbations. Once these disturbances reach appreciable amplitudes, they become host to sporadic bursts of turbulence — a chaotic state whose complexity is only tractable by high-fidelity large-scale simulations. By performing direct numerical simulations that resolve the dynamics of laminar-to-turbulence transition in space and time, and storing the full history of the flow evolution, we capture the rare high-amplitude events that give way to turbulence and unravel key characteristics of the laminar-turbulence interface.

Highly Scalable Software for Analyzing Large Collections of RNA Sequencing Data

Ben Langmead, PhD (Dept. of Computer Science) and Jeffrey Leek, PhD (Dept. of Biostatistics)

Highly Scalable Software for Analyzing Large Collections of RNA Sequencing Data: We are developing a radically scalable software tool, Rail-RNA, for analysis of large RNA sequencing datasets. Rail-RNA will make it easy for researchers to re-analyze published RNA-seq datasets. It will be designed to analyze many datasets at once, applying an identical analysis method to each so that results are comparable. This enables researchers to perform several critical scientific tasks that are currently difficult, including (a) reproducing results from previous large RNA-seq studies, (b) comparing datasets while avoiding bioinformatic variability, (c) studying systematic biases and other effects (e.g lab and batch effects) that can confound conclusions when disparate datasets are combined.

FragData—High-fidelity Data on Dynamic Fragmentation of Brittle Materials

Nitin Daphalapurkar (Dept. of Mechanical Engineering), and Lori Graham-Brady (Dept. of Civil Engineering)

Professors Daphalapurkar and Graham-Brady of Hopkins Extreme Materials Institute are constructing a massive dynamic-fragmentation database (FragData) for materials undergoing failure in critical applications. They envisage FragData would help expand understanding on the mechanics of failure processes associated with, for example, disruption of asteroids, fragmentation of protection materials under impact, and debris formation of construction materials under catastrophic loading. The idea is to have the database openly accessible, have tools to carry out in situ analysis, and have the database serve as a central platform for other researchers to interpret the massive data from state-of-the-art particle-based and finite-element-based simulation techniques.