The Data-Scope

Technology and scientific methodologies have greatly advanced over the last decade as it relates to data. Scientists have been simultaneously fortunate and unfortunate: these advances are providing larger and more intricate data sets than ever, however, problems arise in evaluating the dataset as a whole and its infinite details.

Processing large datasets requires a significant amount of forethought; traditional supercomputing models do not provide an ideal match to big data, and so compromises or significantly over-complicated solutions are often necessary to achieve science goals. The Data-Scope endeavors to overcome issues related to big data on traditional HPC by doing the following:

Storing the data local to the compute.

Large datasets are traditionally stored on communal “head” or “storage” nodes, which are shared across hundreds or thousands of nodes. Putting aside IO subsystem obstacles, the data also has to travel across high-speed Infiniband or Ethernet networks to computational nodes that typically have less than a TB of slow, local storage. Thus the network and the slow disk become bottlenecks.

One-to-One mapping of users to nodes.

Providing users with their own nodes eliminates the problem of sharing large head node spaces with users that require different access patterns and data layouts. Providing exclusive access to a single project (or complementary projects) allows the data to be laid out in a fashion that is conducive for computation.

Eliminating Bottlenecks.

GPUs are a cost-effective and lightning-fast solution to big data problems. A Fermi-generation NVIDIA GPU is capable of half a Teraflop (double precision) and provides a much higher flop-per-dollar cost .

Leverage GPUs for computation.

The Data-Scope removes the network from the equation during computation of the data. The computational nodes are equipped with twenty-four 1TB hard drives that are mapped one-to-one on the backplane, as well as four MLC SSDs. What this means is the design does not bottleneck on SAS expanders; sequential IO throughput scales with the aggregate bandwidth of the drives to the maximum ability of the host bus adapter.

Application Process

Proposals will be reviewed on-demand as they are submitted by the Data-Scope Allocation Committee and the overall usage of the machine will be evaluated and reported quarterly. Please use the form on the right to contact Data-Scope administrators.

Usage Policies

The Data-Scope is intended to provided a data-intensive analysis capability for Big Data problems. As such, the majority of users will run projects of finite duration, typically 3 to 6 months, and leverage Data-Scope’s unique properties, fast I/O with SSDs or high computing density with GPUs. Proposals that use Data-Scope as a compute facility alone will be redirected to other JHU resources, such as the Homewood High-Performance Computing Cluster (HHPC) or the GPU Laboratory.

Additional Notes

Resident Services
It is expected that a minority fraction of the machine will be used to run long-standing services. The Institute for Data-Intensive Science and Engineering (IDIES) runs many such services, including SDSS.org, The Turbulence Project, and the Open Connectome Project. Proposals to this effect will be considered. However, this will always be a secondary usage of the machine.

Long-Term Storage
Permanent and backed-up storage may be available for projects that are long-term or generate data products that the Investigators cannot easily retrieve. Projects that wish to use long-term storage will be assessed a one-time charge back that will cover the acquisition and deployment of the storage, increasing the capacity of Data-Scope commensurately. Ask us about backup charges. These rates will be determined at the time of proposals, but we expect them not to exceed $100 per Terabyte.

The Data-Scope project was funded by a grant from the National Science Foundation. Intel has provided the CPUs in the amount of $294K. NVIDIA donated 60 of the TESLA C2070 cards.

Application Instructions

Researchers interested in utilizing the Data-Scope instrument should submit a short (1-2 page) pdf document that addresses the following points:

Describe the scientific importance of the computation.

What computation/analysis will be performed?

What are size and format of the the input and outputs data?
Describe the code/software to be executed. Does it need to be customized for the Data-Scope?

How many and what types of Data-Scope resources do you require?

Do you need Windows or Unix?
Do you need GPUs? How many per node (0-2)?
Do you need SSDs? How many per node (max 12)?

What are your storage requirements?

How much scratch storage will the computation use?
How much long-term storage will be needed, and for how long?

Data Handling

How will you ingest data into the system? Over the network, Internet2, or via Sneakernet (by shipping disk drives)?
How will you retrieve results and to what ultimate destination?
Have you optimized your data layout? If yes, please describe how the data are arranged.

Provide a timeline for use of the machine

Initial deployment (small scale to develop and test codes in the Data-Scope environment)
Full-scale deployment (to perform analysis)
Destaging period (to remove results from the machine and deallocate resources)

Please contact Tara Engel (tengel@jhu.edu) if you experience technical difficulties completing your submission or are in need of further clarification regarding the requirements.