Modern scientific datasets present numerous data management and analysis
challenges. State-of-the-art index and query technologies such as
FastBit can significantly improve accesses to these datasets by
augmenting the user data with indexes and other secondary
information. However, a challenge is that the indexes assume the
relational data model but the scientific data generally follows the
array data model. To match the two data models, we design a generic
mapping mechanism and implement an efficient input and output
interface for reading and writing the data and their corresponding
indexes. To take advantage of the emerging many-core architectures, we
also develop a parallel strategy for indexing using threading
technology. This approach complements our on-going MPI-based
parallelization efforts.
We demonstrate the flexibility of our software by applying it to two of
the most commonly used scientific data formats, HDF5 and NetCDF. We
present two case studies using data from a particle accelerator model
and a global climate model. We also conducted a detailed performance
study using these scientific datasets. The results show that FastQuery
speeds up the query time by a factor of 2.5x to 50x, and it reduces
the indexing time by a factor of 16 on 24 cores.