[View PDF]

 

K. Wu, S. Byna, A. Shoshani, in collaboration with LBNL Vis group

Key Ideas

  • Provide uniform array interface for scientific data in commonly used file formats, e.g., HDF5, ADIOS, and NetCDF
  • Provide efficient searching functionality on top of existing user analysis frameworks while expand data handling capability and improve user productivity

Results

  • Public release of FastQuery software
  • Indexed and queried a trillion particle dataset for studying magnetic reconnection (or “space weather”)
    • Built index in 10 minutes
    • Located highly energetic particles in seconds
  • “This is the first time anyone has ever queried and visualized 3D particle datasets of this size.” -- Homa Karimabadi, Physicist from UCSD
  • Particle distribution of highly energetic particles around the region of magnetic reconnection. The off-centered and oblong distribution confirms the existence of a previously speculated property known as agyrotropy.

FastQuery APIParticle Distribution

Particle distribution of highly energetic particles around the region of magnetic reconnection. The off-centered and oblong distribution confirms the existence of a previously speculated property known as agyrotropy.

 

Notes:

HDF5, ADIOS and netCDF are three of the most frequently used file formats among NERSC users according NERSC usage statistics.

FastQuery enables the users to query and index their data files without putting their data into a database management system or loading their data into another system.  These systems typically produce a second copy of the data files.  Since the scientific data sets are usually very large, avoiding making the second copy can significantly reduce the cost of analysis operations.  FastQuery can be incrementally introduced into an existing data analysis workflow and seamlessly increases the analysis capability.  In many cases, as demonstrated in this use case involving magnetic reconnection simulation data, FastQuery provides a new way to accessing large data files.

FastQuery developers are directly working with HDF5 developers to improve the accesses to HDF5 files (through the ExaHDF5 project).  The particular example given on this slide is using HDF5.  We have modified VPIC code to output HDF5 files to significantly simplify the files produced by the VPIC code.  A number of commonly used simulation codes, including S3D and Chombo are producing HDF5 outputs.

Magnetic reconnection also known under a couple of other names such as X-point, magnetic portal, or electron diffusion region, is a key mechanism for creating the aurora in the polar regions.  It could also cause severe disruptions of satellite communication systems.  Studying magnetic reconnection is a critical part of understanding “Space Weather.”

Agyrotropy is postulated as an important measurable feature of magnetic reconnection.  A satellite mission designed to measure agyrotropy has been planned for 2014.  However, due to lack of tools for handling the large amounts of particle data produced from simulations, this feature has not be observed in the simulation data. Without FastQuery, Homa's team was using summary statistics to do their analyses, NOT using the particle data.  The set of tools we have developed (together with ExaHDF5 project) enabled Homa to directly work with the particle data.  We are the first to provide direct evidence from simulations.

We are able to produce FastBit index for more than 1 trillion particles in about 10 minutes.  This is an impressive speed because in 10 minutes some database systems may only index a few million data records.  We are able to locate the high energy particles used for producing the agyrotropy plot in about 3 seconds.  Without proper indexes on the data, the data management system would have to read data at more than ten terabytes per second in order to complete the same task in the same amount of time.  The file system we used has a peak theoretic I/O rate of about 35 GB/s.  Clearly, indexing is essential in reducing the data access time.