2021 Annual Symposium - The Institute for Data Intensive Engineering and Science

Schedule

Click on any of the + links below to expand a section and read the abstract and speaker bio for each talk.

Thursday October 21st

1:00 PM ET: Opening remarks (Alex Szalay)

Alex Szalay
Director
Institute for Data-Intensive Engineering and Science (IDIES)
Johns Hopkins University

At IDIES, faculty and students work together to solve amazing data-intensive problems, from genes to galaxies, including new projects in materials science, and urban planning. Over the last few years, our members have successfully collaborated on many proposals related to Big Data, and we have hired new faculty members, all working on different aspects of data-driven discoveries. Together, we are successful in building a collection of unique, large data sets, giving JHU a strategic advantage in making new discoveries.

COVID has left a mark on our abilities to work together, but we quickly adjusted to Zoom meetings. IDIES had a significant role in helping the COVID related genomic projects, by standing up a GPU infrastructure in a matter of a few hours.

Our collaborations across the many schools of Johns Hopkins have intensified, now include the School of Advanced International Studies, School of Education, the Mathematical Institute for Data Science (MINDS), the multi-institutional Paradim project in materials science (HEMI), the 21st Centuries Cities Initiative (21CC), cancer immunotherapy projects through the Bloomberg Kimmel Institute, and various large-scale studies in genomics and in medicine. Our technology provides the analytics engine at the heart of PMAP. More and more classes are using our SciServer as an interactive tool to support homework assignments in Python.

Over the past two years, IDIES has looked beyond JHU to establish new partnerships with outside organizations. We have active collaborations with the Lieber Institute for Brain Development, and the Kennedy Krieger Institute. We are involved with various space science projects: collaborating with the Space Telescope Science Institute on the WFIRST Space Telescope Data Archive; providing the data system for a new European X-ray satellite, eRosita; and we are working with NASA’s Goddard Space Flight Center on hosting High Energy Astrophysics. Most recently, we started a new project with the National Institute of Standards and Technology (NIST) to build an aggregator for data from Puerto Rico about the effects of hurricane Maria and lately also helping the National Additive Manufacturing Initiative. We are starting a new collaboration with Brookhaven National Laboratories.

We welcome the new members of our Executive Board: Paulette Clancy (WSE), Alex Baras (JHMI), Brian Caffo and Jeff Leek (BSPH). Postdocs and graduate students are working with IDIES faculty on AI-related projects, from materials science to astronomy and cancer biology. Machine learning, in particular Deep Learning, has revolutionized how industry handles Big Data. IDIES and MINDS have decided to work together towards these very important emerging goals, joining our expertise to lead to greater new discoveries. JHU is moving towards creating a major effort in Artificial Intelligence, the AI-X project. IDIES is working very closely with AI-X on its infrastructural aspects.

The NSF has funded a new multi-institutional effort in Turbulence. JHU is a co-lead in a new NSF funded project to create the prototype for the National Science Data Fabric. The Open Storage Network project is gaining nation-wide acceptance.

IDIES aims to accelerate, grow and become more relevant across the University by providing more intensive help in launching and sustaining data intensive projects in all disciplines. We seek new ideas and new directions but cannot do this alone: we need your help and initiative. Please send us your ideas, big or small, on how we can improve our engagement with your research community.

1:15 PM ET: Keynote Address (Judith Mitrani-Reiser)

Judith Mitrani-Reiser
Associate Chief, Materials and Structural Systems Division
National Institute of Standards and Technology (NIST)

Disaster Programs at the National Institute of Standards and Technology Strengthen National Resilience

Extreme events, such as tornados and fires, test buildings and infrastructure in ways and on a scale that cannot be easily replicated in a laboratory. Therefore, actual disasters and failure events provide important opportunities for scientists and engineers at the National Institute of Standards and Technology (NIST) to study these events, and improve the safety of buildings, their occupants, and emergency responders.

NIST has studied and investigated more than 50 earthquakes, hurricanes, building and construction failures, tornadoes, and fires since 1969 under several authorities. Current ongoing technical investigations include Hurricane Maria’s impacts on Puerto Rico and the Champlain Towers South building collapse in Surfside, Florida. The talk will provide an overview of the disaster research conducted at NIST in the Materials and Structural Systems Division informed by recommendations and national strategic plans developed by national disaster statutory programs: Disaster and Failure Studies (DFS) Program, National Earthquake Hazard Reduction Program (NEHRP), and National Windstorm Impact Reduction Program (NWIRP).

The talk will highlight a collaboration with Hopkins’ Institute for Data Intensive Engineering and Science (IDIES) scientists on the use of cyberinfrastructures that combine traditional data management and access services with computing resources. This collaboration leverages services provided by SciServer—such as storing/accessing large disaster data and server-side analysis using Jupyter notebooks—to make NIST disaster discoverable, inform critical disaster response activities, and enable necessary collaborations across stakeholders. The talk will also provide an overview of NIST Professional Experience Program (PREP) and how it enables the collaboration between NIST disaster scientists and IDIES data scientists.

About Dr. Mitrani-Reiser

Dr. Judith Mitrani-Reiser is at the lead of the NCST technical investigation of the collapse of the Champlain Towers South in Surfside, Florida and the leader of the mortality project of the NCST investigation of Hurricane Maria’s impact on Puerto Rico. Judy’s responsibilities at NIST also extend to managing and providing oversight to two other disaster statutory programs — the National Windstorm Impact Reduction Program and the National Earthquake Hazard Reduction Program — focused on interagency coordination to reduce losses in the U.S. from disasters and failures.

Judy is Vice President of the Earthquake Engineering Research Institute (EERI), serves on the Executive Committee of the U.S. Collaborative Reporting for Safer Structures (CROSS-US), co-founded the American Society of Civil Engineers’ (ASCE) Multi-Hazard Risk Mitigation Committee, and was elected to the Academy of Distinguished Alumni of UC Berkeley’s Civil and Environmental Engineering Department. Judy earned her B.S. from the University of Florida, M.S. from the University of California at Berkeley, and Ph.D. from the Caltech.

1:55 PM ET: IDIES Seed Award Upate (Nadia Zakamska)

Nadia Zakamska
Professor
Department of Physics and Astronomy
Johns Hopkins University

The Search for Elusive Progenitors of Type Ia Supernovae

Arguably one of the most enduring mysteries of modern astrophysics is that of the origin of type Ia supernovae, the cosmological standard candles that were used in the Nobel-prize-winning discovery of the accelerated expansion of the universe and are important in the evolution of chemical abundances of galaxies. The most likely scenario is that type Ia supernovae arise as a result of a merger of two white dwarfs — compact remnants of evolution of stars like our Sun.

Although the supernovae themselves are routinely observed as bright astronomical transients out to great distances, to date no binary white dwarfs on track to become type Ia supernovae have been identified. This is due in part to the extreme difficulty of finding such objects. A convincing discovery would require high-quality, time-series spectroscopy and excellent photometry and distance measurements — all for intrinsically faint stars — in order to prove that the binary of white dwarfs exceeds the necessary critical mass for the type Ia explosion and that it would merge in less than the lifetime of the universe.

The field is now on the verge of a major break-through. The 20-year-old Sloan Digital Sky Survey (SDSS), in which JHU has long been a major partner, has just entered its fifth phase which will last about five years and will acquire high-quality spectra of 4-5 million stars across the Milky Way. In particular, the survey will obtain spectra of 200,000-300,000 white dwarfs. Furthermore, European satellite Gaia, operating since 2013, has allowed unprecedented measurements of distances to millions of astronomical sources. The combination of data now emerging from SDSS-V and Gaia may enable the long-sought discovery of type Ia progenitors, but no existing analysis tools can quickly identify the subtle features of white dwarf binaries in this large volume of data.

Our group is exploring a variety of methods to identify white dwarf binaries over a wide range of masses and periods. One of the most promising emerging avenues is to detect rapid radial velocity variations. SDSS spectra are obtained as a series of 15-minute exposures which could be on the same night or on different nights.

If a white dwarf is in a binary system with a period of a few hours or less, its orbital velocity may noticeably change from one 15-minute exposure to another. The variations can be very subtle and especially difficult to identify because of the highly pressure-broadened absorption lines in white dwarf spectra.

We are developing a variety of modern statistical techniques to identify these subtle clues and therefore the most promising candidate white dwarf binaries. This program is already bearing fruit, and our paper on the discovery of a 99-minute white dwarf binary (Chandra et al. 2021) recently became the first scientific result of SDSS-V. With the unprecedented amount of white dwarf data from SDSS-V — an order of magnitude more than from the previous surveys — we are aiming to conduct a complete census of binary white dwarfs in the Solar neighborhood and to elucidate the nature of type Ia progenitors.

Reference

Chandra et al. 2021, Astrophysical Journal, in press, arXiv:2021arXiv210811968C

2:15 PM ET: SciServer Update (Gerard Lemson)

Gerard Lemson
Associate Director for Science
Institute for Data-Intensive Engineering and Science
Johns Hopkins University

SciServer

SciServer (www.sciserver.org) is a collaborative science platform developed at IDIES that provides online storage and computational capabilities to scientists from a range of disciplines. SciServer supports traditional fields of astronomy, cosmology and fluid dynamics to disseminate results from large scale observational surveys as well as simulations, and smaller groups have used SciServer to support their efforts with dedicated storage and computational environments shared among their collaborators.

In this past year we explored how SciServer might interact with commercial cloud providers. Using AWS credits we investigated the use of cloud compute resources for analyzing data obtained from SciServer. Two example projects involved GPU nodes for machine learning on SDSS data and an Elastic Map Reduce node to replicate analysis on time series data from the Zwicky Transient Facility (ZTF). Further investigations will be performed using Azure and a grant from Microsoft.

Our collaboration with NIST is extended for another year. Using SciServer, we have started development of a predictor for damage caused by hurricanes prior to landfall.

A second NIST project aims to prepare SciServer for publishing the data from the AMBench 2022 project. This project will produce data sets from a large variety of detailed measurements of objects created using various methods of Additive Manufacturing. Purpose of these is to provide benchmarks for modelers trying to create simulations of these processes. Whereas testing is performed on the NIST SciServer instance, the public release of the data is planned to use the SciServer at IDIES.

The JHU School of Medicine, Carnegie Institution, and Weill Cornell University used SciServer in their annual Practical Genomics Workshop this year. The four-day workshop consisted of virtual classes with hands on sessions where the participants learned how to analyze single-cell RNA sequencing data using publicly available tools, such as R and Bioconductor. SciServer provided a homogenous online computing environment for all participants which removed many of the usual problems related to individual participant computing environment.

Separate instances of SciServer are now deployed at 4 locations outside of IDIES: Precision Medicine Analytics Platform (JHMI), MPE (Munich, Germany), NIST in the AWS cloud, and NAOJ (Japan). The SciServer platform at NAOJ supports the HSC-PFS (Hyper-Suprime Camera – Prime Focus Spectrograph) project for the SuMIRe (Subaru Measurements of Images and Redshifts) collaboration. This is a custom deployment of SciServer whereby a subset of the full complement of the SciServer tools is deployed using Kubernetes for container orchestration and automated deployment. The SciServer Login Portal, a customized Dashboard with traditional analysis tools developed at NAOJ, and Compute are the currently deployed components at NAOJ and supported by IDIES.

If anyone is interested in publishing their data sets through the SciServer platform, or to use it in the class room, please feel free to contact Gerard Lemson (glemson1@jhu.edu), Science Director of IDIES and Project Manager of SciServer.

2:25 PM ET: IDIES Seed Awardee Update (Jonathan Ling)

Jonathan Ling
Assistant Professor of Pathology
Johns Hopkins University

Comprehensive Analysis of Public Sequencing Archives to Uncover Novel Mechanisms of Pathogenesis in Amyotrophic Lateral Sclerosis

Amyotrophic Lateral Sclerosis (ALS) is a fatal adult onset motor neuron disease characterized by progressive loss of upper and lower motor neurons. Over 5,000 Americans die from ALS each year and with few approved treatments, the average life expectancy for patients is only two to five years after diagnosis. There is an urgent need to identify the genetic and environmental factors that underlie the onset and progression of ALS.

Work over the past decade has revealed that the RNA-binding protein TDP-43 is central to the pathogenesis of ALS. In postmortem brain tissue from ALS patients, TDP-43 forms pathological aggregates outside of the nucleus, where the protein normally resides. In 2015, we discovered that mislocalization of TDP-43 leads to the incorporation of deleterious, cryptic exons that disrupt protein synthesis. Recent studies have further confirmed that cryptic splicing can induce the motor neuron loss observed in ALS.

However, the mechanisms that initiate TDP-43 aggregation and loss-of-function are poorly understood. Various genetic and environmental factors have been proposed to explain how TDP-43 mislocalization and aggregation can occur in ALS, but these studies remain largely inconclusive.

With support from the IDIES Seed Funding Initiative, we have begun to leverage the vast publicly available RNA sequencing archives to uncover datasets that exhibit TDP-43 cryptic splicing. Our goal is to identify datasets that have no prior connection to ALS, with the hope of revealing experimental manipulations and environmental effects that can induce TDP-43 loss-of-function. Such findings would indicate novel mechanistic insights into ALS pathogenesis.

Preliminary analyses have successfully identified several hundred RNA-Seq samples that exhibit TDP-43 cryptic exons, filtered from the over 300,000 samples available in recount3. Many of these samples are related to experimental manipulations of TDP-43 or samples from ALS, which helps to validate our approach. Of the samples that have no obvious relation to TDP-43 or ALS, we have begun to replicate these computational results using in vitro model systems. Interestingly, we also find that RNA-Seq datasets from a variety of cancers appear to exhibit cryptic exons when certain cellular pathways are disrupted. Further study is required to determine the relevance of cryptic exons in other human diseases beyond neurodegeneration.

Our cross-disciplinary effort to bridge neuropathology with big data science aims to answer questions that would otherwise remain inaccessible and to gain insights that may reveal novel therapeutic targets for preventing neurodegeneration.

2:45 PM ET: Break

Break

2:55 PM ET: Keynote Address (K. T. Ramesh)

K. T. Ramesh
Alonzo G. Decker Jr. Professor of Science and Engineering
Johns Hopkins University

AI-X: Bringing Johns Hopkins Together through AI

We will discuss the broad outlines of a university-wide initiative in the AI space, considering processes, prospects, and priorities. The intent is to begin a conversation and encourage collaboration, while seeking input on opportunities and challenges.

About Dr. Ramesh

Dr. Ramesh is known for research in impact physics and the failure of materials under extreme conditions. Ramesh also is a professor in the Department of Mechanical Engineering, and holds joint appointments in the Department of Earth and Planetary Sciences and the Department of Materials Science and Engineering. He is the founding director of the Hopkins Extreme Materials Institute (HEMI), which addresses the ways in which people, structures and the planet interact with and respond to extreme environments. Ramesh’s current research focuses on the design of materials for extreme conditions, the massive failure of rocks and ceramics, impact processes in planetary science, and impact biomechanics. In one project, his lab is developing a detailed digital model of the human brain to help address how brain injury results from head impacts. Other current projects include the use of laser shock experiments to study the deformation and failure of protection materials for the U.S. Army, the use of data science approaches in materials design, the development of a hypervelocity facility for defense and space applications, and modeling the disruption of asteroids that could hit the Earth. He has written over 250 archival journal publications, and is the author of the book “Nanomaterials: Mechanics and Mechanisms.”

3:35 PM ET: Mark O. Robbins Prize (Paulette Clancy)

Mark Robbins Prize in High Performance Computing

In recognition of a cherished friend and contributor to ARCH, IDIES and JHU, the Robbins Prize was instigated in 2020 to recognize outstandingly talented PhD students who reflect Dr. Robbins’ contributions to computational science and engineering. The Robbins’ Prize is made possible thanks to generous donations from the Department of Chemical and Biomolecular Engineering, Hopkins Extreme Materials Institute (HEMI), the Institute of Data Intensive Engineering and Science (IDIES), Department of Mechanical Engineering, and the Department of Physics and Astronomy.

Mark Robbins received his BA and MA degrees from Harvard University. He was a Churchill Fellow at Cambridge University, U.K., and received his PhD from the University of California, Berkeley. Dr. Robbins was a professor in Physics and Astronomy at JHU from 1986 until his untimely death in 2020. He was a renowned condensed matter and statistical physicist who played a key role in supporting the development of computational facilities at JHU, through his leadership for the Maryland Advanced Research Computing Center and the Institute for Data-Intensive Engineering and Science.

The 2021 Robbins Prize awardees are: Dr. Sai Pooja Mahajan (Future Faculty Award), Dr. Karthik Menon (PhD Award), and Dr. Andrew Ruttinger (PhD Award).

3:45 PM ET: Robbins Future Faculty Award (Sai Pooja Mahajan)

Sai Pooja Mahajan
Postdoctoral Researcher
Whiting School of Engineering
Johns Hopkins University

Towards Deep Learning Models for Target-Specific Antibody Design

Recent advances in machine learning, especially in deep learning (DL), have opened up new avenues in many areas of protein modeling. In the most recent Critical Assessment of Structure Prediction, a biennial community experiment to determine the state-of-the-art in protein structure prediction, DL-based methods accomplished unprecedented accuracy in the “difficult” targets category. Protein design is the inverse problem of protein structure prediction, i.e., the prediction of sequence given structure. Antibody design against an antigen of interest is a particularly challenging problem since it involves the design of the highly variable CDR loop regions to bind an antigen with reasonable affinity and specificity. In my talk, I will present some emerging DL-based methods to design proteins applied to the design of antibody CDR regions, and some of the early successes and important outstanding challenges in the use of current DL frameworks for protein, antibody and interface design.

4:15 PM ET: Robbins PhD Award (Karthik Menon)

Karthik Menon
Graduate Student
Johns Hopkins University

Computational and Data-Driven Analysis of Aeroelastic Flutter

The interaction of fluid flows with flexible and moving surfaces is a problem of wide applicability and exhibits highly non-linear responses of the fluid as well as the immersed surface. A particular source of complexity in these flows is the generation of several vortices, their interactions, and the non-linear forces they induce on immersed surfaces. In this talk, I will discuss our efforts in dissecting the flow physics of aeroelastically pitching wings using computational modeling and data-driven methods. I will demonstrate a novel energy-based tool to analyze, predict and control the often non-intuitive oscillation response of such systems. I will also describe data-driven techniques we have developed to analyze the vortex dynamics that drive the physics of such problems.

Posters

4:35 PM: Poster Madness (2-minute introduction to each poster)

4:45 PM: Poster Session

Click on any of the titles below to read the poster abstracts.

Click on the poster preview image, if available, to view the full-size poster.

Data Science and Analytics for Esports (Agrawal)

Arjun Agrawal
Peddie School

The use of analytics in professional sports is widespread and rapidly increasing. Similarly, there is a need for analytics in the emerging area of esports, or professional video gaming. Counter-Strike: Global Offensive, also known as CS: GO, is one of the most popular esports with over forty million copies sold, yet it has lacking analytics. This impedes simple and efficient evaluation of competitive CS: GO matches, player performance, and team performance, which is critical to teams, bettors, media, and fans. To this end, we introduce an analytics package consisting of (1) generalized functions to allow for the efficient filtering and aggregation of data; and (2) specialized functions to allow for the efficient calculation of CS: GO match statistics.

Optimized Visualization of Large Scale Ocean Circulation Simulations (Connolly)

Click for Thomas Connolly's poster — ***Click to see full-size poster***

Thomas Connolly
Carnegie Mellon University

The purpose of this project was to apply image interpolation techniques to develop a dynamic tile-server visualization for a large oceanographic dataset. Poseidon will be a large ocean circulation simulation that will model a number of ocean properties at 1.5 km resolution over the whole globe. The final dataset will be about 2.5 petabytes, so rapid visualizations that do not require preprocessing of data could be invaluable in effective analysis of the simulation data. The method of rendering images directly from the dataset relies on building interpolator objects that track which points in the dataset are relevant to the interpolation of pixel values at each resolution. Interpolators are built by transforming data points into the visualization’s Web Mercator projection and resampling at each resolution to calculate pixel interpolation weights. There are three stages to the optimization of this technique. First, data points not used in the interpolation of ocean pixel values, e.g. interior land data points, are masked out of the interpolator objects. Second, interpolators are broken into tiles at each resolution to allow single image tiles to be dynamically computed at each resolution. Third, the data point indices in each interpolator are reindexed into the coordinates of the original dataset to allow data points to be accessed directly without intervening coordinate transformations. Applying these interpolators to a chunked zarr data structure allows for very efficient dynamic image rendering, especially for low level tiles. The visualization rendering could be further improved by optimizing the zarr chunk size of the dataset to match the minimum interpolator size, although this could slow down other large-scale operations on the data.

OpenMSI: A Materials Semantic Infrastructure for Streaming Data Integration in Accelerated Materials Development (Elbert)

Click for David Elbert's poster — ***Click to see full-size poster***

David Elbert
JHU Krieger School of Arts and Sciences

The OpenMSI project establishes a new paradigm for the materials discovery and design loop centered on flow of data rather than individual modeling or experimental tasks. Integrated data flow accelerates the iterative materials research process and provides a framework for Materials Genome Initiative science. OpenMSI centers on an open semantic infrastructure and streaming data platform based on Apache Kafka to integrate the processing, experimental, and modeling components of materials design. The openmsipython package (https://github.com/openmsi/openmsipython/) extends Apache Kafka for serialization/deserialization and chunking of large files and efficient streaming. Asynchronous data producers and consumers run as services or daemons on laboratory control nodes providing a loosely coupled architecture that is efficient to deploy, maintain, and extend.

Infrastructure development is embedded in a science program focused on creating aluminum alloys resistant to spall failure in high-energy environments. Such alloys have high value in aircraft and spacecraft while understanding the underlying mechanism of failure has broad scientific importance in understanding ultimate material strength. Experimental characterization utilizes laser-induced shock waves to drive foil micro-flyers to impact samples and effect spall failure. Micro-flyer velocities are captured with photon doppler velocimetry (PDV) in one-dimensional FFT traces of interferometry and include our first data streamed through OpenMSI services. PDV data streams are asynchronously consumed to automate analysis and provide rapid feedback during repeated shock experiments.

Exploring the Outskirts of Nearby Dwarf Galaxies with Blue Stars (Filion)

Click for Carrie Filion's poster — ***Click to see full-size poster***

Carrie Filion
JHU Kreiger School of Arts and Sciences

The identification and study of stars far from the center of their host galaxy enables exploration of that galaxy’s total mass and current dynamical state. It is difficult to identify candidate member stars in a wide field-of-view around a given galaxy, as the density contrast between the galaxy’s member stars and non-member Milky Way stars decreases as a function of distance from the center of the galaxy. We have developed a technique to find candidate member stars of nearby dwarf galaxies, and we present a demonstration of this technique on the Bootes I ultra-faint dwarf galaxy, a satellite of the Milky Way Galaxy.

Optimal Area Monitoring: Line-of-Sight Viewsheds in Parallel (Gu)

Click to download Peter Gu's poster — ***Click to see full-size poster***

Peter Gu
JHU Applied Physics Laboratory

Fixed towers with sensors can be important assets for monitoring a large land area, but obstructions and terrain can make it hard to calculate and optimize the combined effectiveness of a set of towers before construction. Presented here is a method that uses the line of sight and probability of detection of each tower and yields one number to objectively measure their performance: number of distinct viable paths from a start line to an end line. This number can then be used as a function for optimization of tower placements.

Access to Credit in Majority-Black Neighborhoods and Welfare Implications (McComas)

Mac McComas
JHU 21st Century Cities Initative

Achieving the American dream – the freedom of opportunity and the ability to improve one’s economic wellbeing by investing in education, real estate, and entrepreneurship – requires capital. But, in the United States, access to capital for individuals and business owners is uneven based on race. In 2019, the median net worth of a typical white household, $188,200, was 7.8 times greater than that of a typical Black household, $24,100 (Bhutta et al., 2020). Most houses are bought with a mortgage and most businesses rely on credit to fund their expansion. And place matters more than ever as the overall geographic mobility of Americans is declining (Molloy et al 2017). This working paper examines access to credit in majority-Black communities and examines several private and public sector levers that could improve access. We examine the role that detailed microgeographic data could have on increasing the scrutiny of financial institutions under the 1977 Community Reinvestment Act, the role of the public and private sector in investing in Minority Depository Institutions, and the role of increasing access to and capacity of Community Development Financial Institutions.

Take Me Out to Big Data: A Free Online Database of Baseball Events (Raddick)

M. Jordan Raddick
JHU Institute for Data-Intensive Engineering and Science

Physics-Informed Deep Learning in Astrophysics (Wei)

Click for Viska Wei's poster — ***Click to see full-size poster***

Viska Wei
JHU Krieger School of Arts and Sciences

Modern machine learning is becoming increasingly important in science Machine learning, in particular Deep Learning is emerging as a promising way to overcome this barrier In science we need to understand and estimate the statistical significance of our derived results, and there is skepticism towards ‘black box’ techniques For data with large dimensions, the networks can get quite large, making training slow and cumbersome. As a result, serious attention is being given to Physics Informed Machine Learning how we can use prior knowledge about underlying symmetries, geometric and physical properties of the data to simplify network design

Interactive Visualization of Large Ocean Circulation Simulations (Wen)

Click for Brian Wen's poster — ***Click to see full-size poster***

Brian Wen
JHU Krieger School of Arts and Sciences

The Poseidon project will generate a 1.5km resolution, 1 year simulation of ocean circulation for the whole Earth, generating 2.5 petabytes of data. We need to visualize this data consisting of many quantities (salinity, temperature, velocities, vorticity) at different depths and resolutions, preferably create animated time-lapse views on the fly. The simulations are done in a complex, warped coordinate frame, the LLC4320 grid. Precomputing web-formatted images of the whole data requires essentially doubling the storage. The only way to accomplish this is to create a dynamical visualization, where the simulation data is rendered on the fly, as needed. We have developed a pyramid of precomputed interpolator objects which can take image content from the different layers and scalars. We use barycentric interpolation, a common technique used to compute scalar field values over an irregular geometric grid. Using this as our basis, we have implemented various levels of optimizations, such as interpolation masks, in order to speed up the visualization of large-scale oceanographic data. Our visualization interface uses precomputed images for the first few zoom levels, then renders image dynamically as the zoom level increases. The resulting rendering time increases quadratically with respect to the size of the underlying raw data tile, but as these get smaller as we zoom in the performance gets faster and faster. We are currently exploring the optimal tradeoff in the crossover point between precomputed and dynamic images.

Friday October 22nd

1:00 PM ET: Opening remarks (Alex Baras and Paul Nagy)

Alex Baras
Director, Pathology Informatics
School of Medicine
Johns Hopkins University

Paul Nagy
Deputy Director
Johns Hopkins Medicine Technology Innovation Center

1:15 PM ET: Keynote Address (Rebecca Lindsey)

Rebecca Lindsey
Staff Scientist
Materials Science Division
Lawrence Livermore National Laboratory

Data Science and Machine Learning for Materials Under Extreme Conditions

For the past several decades physics-driven advances in experimental and simulation capabilities have served as the primary force enabling improved understanding of material evolution under extreme conditions. The pace of advancement has largely been limited by the complicated, highly dynamic, and inherently multiscaled nature of this phenomenon. However, data driven approaches are providing a path forward. From high throughput experiments to improved physical and reduced order models, this new paradigm has had a transformative effect on research in the physical sciences. In this presentation, I will discuss two broad machine learning efforts that aim to improve our understanding of the microscopic phenomena governing material evolution under extreme conditions via atomistic simulations and enable prediction of age-related changes in material performance from experimentally derived characterization data.

About Dr. Lindsey

Dr. Lindsey is a Staff Scientist in the Materials Science Division at Lawrence Livermore National Laboratory. Her research focuses understanding how materials evolve under extreme conditions, and how aging impacts their performance. She leverages machine learning and data science to generate next-generation reactive interatomic models enabling quantum accurate simulation of phenomena including shockwave-driven nanocarbon formation, and to build diagnostic models for complex systems capable of predicting device performance from materials characterization data. Her efforts are underpinned by a strong interest in developing tools enabling work in previously inaccessible problem spaces. Dr. Lindsey’s work, which has had implications for nanomaterial fabrication, civil engineering, defense applications, and possible origins for life, were recently recognized through a LLNL Physical and Life Sciences Directorate Research Award.

1:55 PM ET: IDIES Seed Awardee update (Thomas Lippincott)

Thomas Lippincott

Assistant Research Professor
Whiting School of Engineering
Johns Hopkins University

Developing Datasets and Infrastructure to Facilitate Translating Humanistic Data and Hypotheses into Computational Inquiry

The initial phase of our research has focused on assembling suitable datasets for the specific subdomains of interest in English literature and the Ancient Near East, working with faculty and graduate students to refine the humanistic questions to be asked, and focusing our list of engineering goals to maximize impact for this and future collaborations.

With the Department of English, we have assembled a corpus of approximately 60,000 documents from the 16th through 19th centuries published in England, Scotland, and Ireland. Our goal is to consider how extracted linguistic patterns reflect the evolving English attitudes towards Ireland, and how this evolution aligns with major events (e.g. acts of Parliament, insurrections) and across time/space. We are starting with simple statistical tests targeting specific terminology and contexts annotated by graduate students.

With the Department of Near East Studies, we are testing, and potentially refining or expanding, a hypothesis regarding the scribal hands responsible the El Amarna Letters. This cache of tablets, written in the Cuneiform script, contains correspondence between the Egyptian Pharoah Akhenaten and client kingdoms in the years preceding the Bronze Age Collapse. By encoding the existing hypothesis, linking it to images and transliterations, and gathering expert annotation from graduate students focused on this area, we set the stage for bringing techniques from natural language processing and computer vision to bear on questions typically answered via close manual scrutiny.

In response to the common needs of these applied studies, we are focusing engineering efforts on extending the Turkle annotation framework to perform annotation of images, combined temporal and geographic visualization of arbitrary features, and a flexible web editor to guide humanists in specifying and validating descriptions of their domains. These tasks are all against the backdrop of adopting JSON-LD as the canonical underlying format for linked data, and deploying a production-quality server under the JHU domain that will facilitate access and consolidate the public-facing aspects of our research.

2:15 PM ET: ARCH Update (Jaime Combariza)

Jaime Combariza

Director
Advanced Research Computing at Hopkins (ARCH)

Shared CyberInfrastructure for Advanced Computing at Hopkins

Advanced computing, integrating traditional High Performance, Data Intensive (DI), and Artificial Intelligence (AI, machine and deep learning) to enable fast growing, challenging projects in science and engineering founded on data-driven research, has become a priority for funding agencies. In the last two years, the National Science Foundation (NSF) has invested over $100M to deploy several powerful Advanced Computing systems that provide break-through cyberinfrastructure capabilities. In order to remain competitive, academic institutions must plan for these constant new developments in advancing computational research.

JHU is implementing new models to provide, maintain and sustain a state of the art facility. For an essential core facility of this magnitude, no single entity’s financial infrastructure can be relied upon to support refresh cycles every three to five years. Thus arose the need for a communal support system.

A three pronged plan (as described in figure 1) relying on the contributions of separate yet interdependent support groups is being implemented with great success. The HPC core facility is supported by the University through each of its schools, by research groups that join forces and apply for large grants (MRI, MURI, DURIP), and by individual researchers that add small condos to better serve their computational needs. The results are community-shared resources, guaranteed sustainability and the creation of a core facility with enough capacity to enable desired research. It is important to stress the “shared-resource” model, which allows any one group to use additional resources above their individual contribution. Additionally, JHU deanery are contributing to the annual operation of the facility, minimizing costs for JHU faculty in exchange for ‘sharing’ resources.

Currently the new cluster, “Rockfish”, is growing fast: doubling the compute capacity and number of cores in a single year. Following the diagram described in Figure 1, the large grant contribution is composed of an NSF MRI grant that provided 388 compute nodes and 10PB of storage. This grant was also used to provide the shared infrastructure that will house other condos. Another large grant from DoD (DURIP) added 74 compute nodes, for a total of 462 nodes. The second contribution (first quarter 2022), provided by the JHU deans, will add 120 compute nodes plus 4 PB of storage. The final circle of contributions consists of over 150 compute nodes from 26 research groups and is expected to continue to expand over the next year. Rockfish will have over 730 compute nodes, 3.4PFLOPs theoretical peak and about 2.2PFLOPs sustained peak. This increased computational power makes resources at Hopkins comparable to those of peer institutions across the nation.

The new shared cluster, Rockfish, will have three sets of compute nodes: 680 regular memory compute nodes with 48 cores per node, 192GB RAM and a local NVMe SSD with 1 TB capacity; a set of 27 nodes with 48-cores and 1.5 TB of memory, and, finally, a set of 19 GPU nodes featuring the newest Nvidia technology (Ampera-100). GPU nodes will have either 2 or 4 A100, 40GB GPUs. All nodes and storage are connected via 100gbps Infiniband, allowing fast I/O and internode communication for parallel jobs. Rockfish has a parallel file system with 16 PB of storage and was placed in production in March 2021. Rockfish also supports two other important groups: Morgan State University, as a partner on the MRI, and the national community through the distribution of 20% of the computing resources via XSEDE.

The success of this three pronged approach continues to ensure the future of HPC at Hopkins. We would like to invite all current university research groups, regardless of their current research computing model, to contribute in this endeavor by procuring funding to add their own condos.

2:25 PM ET: Break

Break

2:35 PM ET: Keynote Address (Ryan Abernathey)

Ryan Abernathey
Associate Professor of Earth and Environmental Sciences
Data Science Insitute
Columbia University

Pangeo: A Model for Cloud Native Scientific Research

As a result of advances in remote sensing and computer simulation, geoscientists are now regularly confronted with massive datasets (many TB to PB). While such datasets have great potential to move science forward, they require a new approach to data sharing and computing infrastructure. The Pangeo Project aims to empower geoscientists to work painlessly with such datasets using open source software and infrastructure. In this talk, I will describe the architectures and best practices that have emerged from this project which form a foundation for future “cloud-native” science. These include the use of object storage for building analysis-ready, cloud-native data repositories, data-proximate computing with Jupyter, and on-demand scale-out distributed computing with Dask. I will demonstrate these tools in action with real science workflows from oceanography and climate science. I’ll also discuss some technical and social challenges our project is facing as we try to transition from promising prototypes to sustainable infrastructure for our field.

About Dr. Abernathey

Dr. Abernathey is a physical oceanographer who studies large-scale ocean circulation and its relationship with Earth’s climate. He received his Ph.D. from MIT in 2012 and did a postdoc at Scripps Institution of Oceanography. He has received an Alfred P. Sloan Research Fellowship in Ocean Sciences, an NSF CAREER award, The Oceanography Society Early Career Award, and the AGU Falkenberg Award. He is a member of the NASA Surface Water and Ocean Topography (SWOT) science team and Director of Data and Computing for a new NSF Science and Technology Center called Learning the Earth with Artificial Intelligence and Physics (LEAP). Prof. Abernathey is an active participant in and advocate for open source software, open data, and reproducible science.

3:15 PM ET: Invited Talk (Janis Taube)

Janis Taube
Director
Division of Dermopathology
School of Medicine
Johns Hopkins University

Astropath: Mapping Cancer as if it were the Universe

About Dr. Taube

Dr. Janis Taube is a professor of dermatology and pathology at the Johns Hopkins University School of Medicine and a member of the Johns Hopkins Kimmel Cancer Center. Her area of clinical expertise is dermatopathology. Dr. Taube serves as the Director of the Division of Dermatopathology and as the Assistant Director of the Dermatoimmunology Laboratory at the School of Medicine.

Dr. Taube received her undergraduate degree in engineering from Duke University. She earned her M.D. from Tulane University and her M.Sc. in molecular medicine from University College London. She completed her residency in pathology at Johns Hopkins where she also served as the chief resident, before undertaking a dermatopathology fellowship at Stanford University. In 2009, Dr. Taube returned to Johns Hopkins for her certification in the Melanoma Clinic.

She is one of the lead scientific researchers in the Department of Dermatology at Johns Hopkins. Her research is related to the study of the B7-H1 molecule. Dr. Taube and her team are seeking to identify the signaling mechanisms behind B7-H1 expression.

She is a member of the College of American Pathologists, United States and Canadian Academy of Pathology, American Society of Dermatopathology and Dermatology Foundation.

3:45 PM ET: IDIES Student Fellow Awards

The IDIES Summer Student Fellowship program invites JHU undergraduate students to submit a 10-week summer research project with a focus in data science, and guidance from an IDIES faculty mentor. These projects are meant to provide an opportunity for students to participate in a full-time data science focused project, and encourage further interest in research while rounding out their undergraduate experience. In addition to a data-intensive computing focus, student projects must: relate to the IDIES mission, encompass the potential to advance knowledge, challenge and seek to shift current research/practice by utilizing novel concepts, approaches or methodologies, and benefit society and contribute to the achievement of specific, desired societal outcomes.

3:50 PM ET: IDIES Student Fellow Update (Jaxon Wu)

Jaxon Wu
Johns Hopkins University

Humanizing Our Data: Proposal on Integrating Social and Behaviorial Determinants of Health into Population Health Analytics

The aims of my project can be clearly defined as the following: first, I sought to explore the prevalence of SDoH needs in administrative claims, EHR, and claims-EHR combined data of a Medicaid population and extract the most widely documented ICD-10 Z-codes; second, I assessed the impact of ICD-10 social needs on healthcare utilization—narrowly defined as emergency department (ED) visits and inpatient hospitalizations—and health care expenditures—seen through total healthcare, pharmacy, and medical costs—of our Medicaid population. For each regression, we ran five different models that each had different combinations of independent variables. The base model contained our covariates and models 1 and 2 and models 3 and 4 respectively captured social needs markers and domains alongside ACG count versus ACG scores.

4:00 PM ET: IDIES Student Fellow Update (Shengwei Zhang)

Shengwei Zhang
Johns Hopkins University

Using Machine Learning to Predict Surgical Case Duration in Operating Room Scheduling Optimization

Operating rooms (ORs) are the most expensive and financially productive resource in a hospital, and any disruption in their workflow can have a detrimental effect on the rest of the hospital operations. The goal of this project is to develop various machine learning models that can effectively predict the surgical case duration and compare the predictive power of these models to surgeon’s empirical estimation. For the all-inclusive model, the comparison of three modeling algorithms’ result shows that Random Forest and XG Boosting have a better predictive capability than Linear Regression and XG Boosting works much faster than Random Forest. For the service-specific model, the comparison of three modeling algorithms’ result shows similar prediction accuracy on many OR services. The service-specific model obviously performs better than the all-inclusive model, but it also has some limitations. Since it only trains and tests on instances of a specific OR service, the data size is a huge confounder. In the dataset, 16 services have less than 100 instances, which make Random Forest and XG Boosting not suitable for these services. While in the all-inclusive model, since services are not treated respectively, the model performance is consistent for each service, regardless of the number of instances in the dataset.

4:10 PM ET: Closing Remarks (Ani Thakar)

Ani Thakar

Principal Research Scientist
Institute for Data-Intensive Engineering and Science (IDIES)
Johns Hopkins University

Closing Remarks

About Dr. Thakar

Ani Thakar is a Principal Research Scientist in the Physics & Astronomy Department, and Associate Director of Operations for IDIES. He is also Catalog Archive Scientist for the Sloan Digital Sky Survey and JHU Lead Scientist for the fifth phase of the Sloan Digital Sky Survey (SDSS-V).