The following projects are run through IDIES.

recountrecount2 provides genome coverage, gene counts, and exon counts obtained by Rail-RNA across almost 50,000 human RNA-seq samples publicly available on the Sequence Read Archive. These outputs are easily incorporated into downstream analyses using the R/Bioconductor package recount2.

Visit recount2
SciServer brings the analysis to the data. SciServer is a modular system of independent components that work together to create a full-featured research environment for data-intensive and computationally intensive science. Some modules were inherited from proven systems and optimized for Sciserver, while others were developed specifically for Sciserver. New modules are continually in development, including Tools and Libraries for iPython (Jupyter) Notebooks, Docker and more.

Visit SciServer
The Data-Scope is a computing instrument designed to enable data analysis tasks that were simply not possible before. The instrument’s unprecedented capabilities combine approximately five Petabytes of storage with a sequential IO bandwidth close to 500 GBytes/sec, and 600 Teraflops of GPU computing. Data-Scope provides extreme data analysis for Petabyte-scale datasets.

MRI: Development of Data-Scope – A Multi-Petabyte Generic Data Analysis Environment for Science.

Apply to use the Data-Scope View Award at NSF
Today's data projects are helping scientists make amazing discoveries at an unprecedented rate. But in a world where we can't watch home movies from 20 years ago, how will tomorrow's scientists take advantage of these rich datasets?

Using the lessons we have learned from developing the Sloan Digital Sky Survey's SkyServer website, IDIES researchers are developing a flexible, reusable system for ensuring access to scientific data for decades to come. We will formalize the components of our new system through clean, usable APIs.

The main building blocks of the new system are:

  • Unified schema design and metadata management system for scientific datasets
  • Parallel transform-and-load environment for large databases
  • Object-oriented extensions as User-Defined Functions in a relational database
  • Extended spatial searches in 2- and 3-dimensions
  • A flexible web services framework
  • Collaborative user space next to large databases
  • Data management tools for the “long tail” of science

DIBBs (Data Infrastructure Building Blocks): Long Term Access to Large Scientific Data Sets–The SkyServer and Beyond

Visit SkyServer View Award at NSF
IDIES researchers have used a publicly-accessible 50 Terabyte database to resolve a fundamental paradox seen in magnetic fields. Their new discovery can help us understand the nature of solar flares and other types of "space weather" that poses a threat to communications satellites and electrical systems here on Earth.

The database, hosted by IDIES, contains the results of a detailed simulation of magnetohydrodynamic (MHD) turbulence. The paper announcing these results appeared in Nature. The lead author is IDIES member Gregory Eyink of the Department of Applied Mathematics and Statistics.

The team's analysis of the simulations showed that flux freezing was violated when the fluid is turbulent. That means that the breaking and reconnecting of magnetic field lines is possible on much shorter, more realistic, time scales than would otherwise be possible under the classical flux freezing hypothesis. This discovery solves the "scale problem" by explaining how microscale mechanisms of flux freezing violations can be accelerated by turbulence, thus allowing them to reconnect enormous structures at scales order of magnitudes larger. The 50 Terabyte database that IDIES provided was critical to carrying out the study, since the time-irreversible nature of resistive MHD requires tracking magnetic field-lines backward in time. The new discovery not only solves an important scientific problem, it also shows the unique capabilities of the data-intensive research that IDIES enables.

Co-authors of the Nature article were Cristian Lalescu and Hussein Aluie (both from Applied Mathematics and Statistics, Aluie is also affiliated with Los Alamos National Lab), Kalin Kanov and Randal Burns, from the Department of Computer Science; Charles Meneveau, from Mechanical Engineering; and Alexander Szalay, from Physics and Astronomy. The co-authors from other institutions were Ethan Vishniac, from the University of Saskatchewan, Canada; and Kai Bürger, from the, Technische Universität München, Munich, Germany.

Figure 2 of the paper shows a representative snapshot of the Ohmic electric field in the simulation. Animation courtesy of Dr. Kai Buerger, Technische Universität München, Germany.

Funding for the research was provided by National Science Foundation grant CDI-II: CMMI 0941530, and the database infrastructure was funded by NSF grant OCI-108849 and by Johns Hopkins Institute for Data Intensive Engineering and Science. Support also was provided by Microsoft Research.


View News Release
The goal of the 100 Gig Connectivity project is to establish a high-speed data overlay research network across Johns Hopkins. The network will connect six locations across the University with multiple 10G connections, aggregated into a single 100G outgoing line to the MidAtlantic Crossroads, and beyond to the Teragrid and Internet2.

Datasets throughout the research community are already pushing the limits of a 10G connection, and researchers need a more efficient network to transfer these datasets to observation instruments such as the Data-Scope. Demonstrating the ability to move Petabytes of data, and analyzing them in a timely fashion would encourage others to follow, and would change the way we approach large data problems in all areas of science today. This high-speed connectivity will enable JHU and its partners to move Petabyte-scale data sets, and will enable the wider scientific community to tackle cutting-edge problems from the traditional HPC and CFD to people who study the connectivity of the Internet.

Developing novel theoretical methods and algorithms for clustering massive datasets with applications to astronomy, neuroscience and natural language processing.

Go to Project Website

Offering a postdoctoral fellowship program funded by the Moore and Keck Foundations for six postdocs working in innovative projects related to data-intensive computing