[View PDF]

 

John Wu, Arie Shoshani (LBNL)

Problem

  • Quickly find records satisfying user-specified conditions from a large, complex data set
  • Example: High-energy physics data – from billions of events find collision events with a given energy level and having a specified number of tracks

Solution

  • Developed new indexing techniques and a new compression method for the indexes, achieved 10-100 fold speedup compared with existing methods
  • Efficient software implementation: available open source from http://sdm.lbl.gov/fastbit/ (>10,000 downloads), received a R&D 100 Award

Impact

  • Gene Context Analysis in IMG used to time-out when comparing 5 or more organisms; with FastBit technology, the hardest version of this problem requires no more than 10 seconds
  • Searched through trillion-particle data set from an astronomy application in seconds: “This is the first time anyone has ever queried and visualized 3D particle datasets of this size.” -- Homa Karimabadi, Physicist from UCSD
  • Testimonial “FastBit is at least 10x, in many situations 100x, faster than current commercial database technologies” -- Senior Software Engineer, Yahoo! Inc

Wu.LBNL.DM.Indexing-fig1

Notes:

Algorithms developed:

  • new compression method: Word-Aligned Hybrid (WAH) compression, (1) WAH compressed equality index is 14X faster than a similar indexing method in ORACLE database system, (2) analysis shows that WAH compressed index is theoretically optimal, which places it among the best of database indexing methods
  • new bitmap indexing methods: multi-level bitmap indexes, particularly, the two-level interval-equality index and the range-equality index are not much larger than the commonly used equality index, but can answer queries as much as 10X faster (both in theory and in practice)
  • new binning techniques for bitmap indexes: binning the data to a lower resolution, augment bitmaps with order-preserving bin-based clustering (OrBiC).  Since most of the user queries use constants of relatively lower precision, commonly one- or two-digit precision, such as 1,000,000 (=106, which has 1-digit precision) and 25,000 (=2.5x104, which has 2-digit precision), the indexes with one- or two-digit precision can answer these queries accurately.  Even when the binned index is not able to give a precise answer, the OrBiC data structure can significantly reduce the amount of data needed to be accessed in order to answer a query

Applications:

  1. Gene Context Analysis: This is a unique analysis tool for determining functions of genes based on the context (formalized as gene cassettes), it is the 3rd strategy of determining gene functions after chemical experiment and similarity search.  A target organism is typically compared against a number of reference organisms.  The time required for GCA grows quickly with the number of reference organisms.  Previously, IMG performs this operations using ORACLE database system.  The GUI front-end frequently times out when more than 5 organisms are used as references.  The new FastBit based technique can answer the most complex version of the query in less than 10 seconds on a workstation.  The result has been published at SSDBM 2013 <http://dx.doi.org/10.1145/2484838.2484856>.  This use case could be described as “solving a unsolvable problem in 10 seconds.”
  2. Space weather (astronomy application): FastBit was used to index and query more than 1 trillion particles <http://dl.acm.org/citation.cfm?id=2389077>
  3. Laser Wakefield Particle Accelerator: Our work with this group is more recent, FastBit is used as the back-end of a visual analytics engine for identifying and tracking particles.  This capability is used by the application scientists for identifying the coherent groups of accelerated particles (which are the desired output from such an accelerator) and track their formation process.  Details of the work can be found in Proc. SciDAC 2009 and SC08.
  4. Combustion data analysis: in this application, the objects of interest are ignition kernels, i.e., the regions where burning started.  The application scientists first explore different ways of defining these ignition kernels.  In this process, it is critical to be able to find ignition kernels quickly and provide preliminary statistics about them.  FastBit’s efficient search capability is critical in this process.  Additionally, we worked on grouping the points in space efficiently into connected regions.
  5. FastBit has been used in a number of commercial applications.  For example, Jochen Schlosser and Matthias Rarey developed a new virtual drug screening tool named TrixX-BMI, and showed that it can screen libraries of ligands 140--250 times faster than the state of art screening tools (see page 18 of their presentation at ACS Fall 2007 meeting, full article in J. Chem. Inf. Model., Article ASAP, DOI: 10.1021/ci9000212).  TrixX-BMI is to be marketed by BioSolveIT of Germany.  Yahoo! Inc. is also using FastBit to handle some advertisement related data.  The quote in the last bullet is from a Senior Software Engineer working on the project.

Abbreviated time line of FastBit development. Here are some explanation of the items listed

  1. 1999: start research work
  2. 2001: 1st publication of WAH compression shown it significantly improve the speed of answering queries
  3. 2004: WAH patent granted by US patent office
  4. 2006: Query Driven Visualization began to take shape with FastBit as the backbone.  This builds upon a number of earlier collaborations between the Scientific Data Management group and the Visualization group of LBNL.  This year we also published a major theoretical analysis of the compression algorithm (WAH) used in FastBit.
  5. 2007: FastBit officially released to the public in August of 2007.  A Germany company BioSolveIT started to explore using it for a new technique for molecular docking, a virtual screening tool used by many pharmaceutical companies for select useful molecule for developing new drugs.
  6. 2008: Won R&D 100 Award.  Yahoo! Research started exploring using FastBit for management of advertisements.  The quote was from a engineer on this project.