Parallel Index and Query for Large Scale Data Analysis

Jerry Chou, Kesheng Wu, Oliver Ruebel, Mark Howison, Ji Qiang, Prabhat, Brian Austin, E. Wes Bethel, Rob D. Ryne, Arie Shoshani


Modern scientific datasets present numerous data management and analysis challenges. State-of-the-art index and query technologies are critical for facilitating interactive exploration of large datasets, but numerous challenges remain in terms of designing a system for process- ing general scientific datasets. The system needs to be able to run on distributed multi-core platforms, efficiently utilize underlying I/O infrastructure, and scale to massive datasets.
We present FastQuery, a novel software framework that address these challenges. FastQuery utilizes a state-of-the-art index and query technology (FastBit) and is designed to process mas- sive datasets on modern supercomputing platforms. We apply FastQuery to processing of a massive 50TB dataset generated by a large scale accelerator modeling code. We demonstrate the scalability of the tool to 11,520 cores. Motivated by the scientific need to search for interesting particles in this dataset, we use our framework to reduce search time from hours to tens of seconds.

full text of LBNL-5317E (PDF)

NERSC blog
FastQuery multi-thread version at Cluster 2011
FastQuery MPI version at SSDBM 2011

More research work by John Wu
Bitmap Index
Connected Component Labeling
Eigenvalue Computation
Inforamtion available elsewhere on the web
Google Scholar
Contact us

John Wu