Datasets - The Institute for Data-Intensive Engineering and Science

The Johns Hopkins Institute for Data-Intensive Engineering and Science (IDIES) hosts many Petabytes of rich scientific data that can be used to answer questions in a variety of scientific domains. We are proud to offer simple, free online access to these datasets, so that students, instructors, and citizen scientists can make use of the same data as cutting-edge researchers, using many of the same tools.

This list describes some of the datasets that IDIES provides, sorted by science domain.

Astronomy

SLOAN DIGITAL SKY Survey

The Sloan Digital Sky Survey (SDSS) is an ongoing project to make a map of the Universe. Using telescopes in the United States and Chile, the SDSS has taken images of more than 900,000,000 sky objects, and spectra for more than five million. The SDSS holds data releases approximately every 18-24 months; the most recent is Data Release 16 (DR16), released in December 2019.

IDIES is the primary repository for SDSS data. We have catalog data (i.e. parameters measured from images and spectra or calculated from other catalog parameters) for all SDSS data releases up to and including DR16. We also have images and spectra for all data releases up to and including DR9, stored as FITS files at various levels of processing. We also provide two additional datasets: Stripe82 contains all photometric data for the repeat observations of the SDSS supernova survey, while RunsDB contains all photometric data for all SDSS observations, including overlap areas.

Project website: www.sdss.org

Gaia

Gaia is a satellite operated by the European Space Agency that is in the process of measuring distances and stellar properties of more than a billion stars all over the Milky Way. IDIES hosts the full catalog dataset for Gaia Data Release 2, available as a searchable database context through CasJobs.

Project website: https://sci.esa.int/web/gaia/

Fluid Dynamics

Johns Hopkins Turbulence Databases

IDIES hosts the Johns Hopkins Turbulence Databases, a set of direct numerical simulations of hydrodynamic turbulence in a variety of settings. The output of these simulations (more than 700 TB) can be queried online, and through a set of web service-based scripts that let you include the data in your own analyses as easily as you could with a local dataset.

Project website: http://turbulence.pha.jhu.edu/

Oceanography

Johns Hopkins Ocean General Circulation Models

IDIES hosts the results of a set of high-resolution ocean General Circulation Models (GCMs) that allow researchers to investigate the dynamics of ocean circulation at many scales in space and time. The full output of all the models is stored as a Data Volume in SciServer Compute, where they can be analyzed with our team’s OceanSpy Python package.

Project website: https://poseidon.idies.jhu.edu/

Genomics

Recount2: Analysis-ready RNA-sequencing gene and exon counts

The Recount2 datasets contain high-level data for more than 70,000 published human RNA samples, allowing researchers to study gene expression on an unprecedented scale. The datasets include genome coverage, gene counts, and exon counts. All the results included in recount2 are available as a Data Volume in SciServer Compute, and can be analyzed inside an associated computing environment (image) preinstalled with the Bioconductor R package.

Project website: https://jhubiostatistics.shinyapps.io/recount/