FastBit
  FastBit Front Page Research Publications Software Documentation Software Download Software License  

Organization: LBNL » CRD » SDM » FastBit » Documentation » Indexing Options

Options for building bitmap indexes

The process of building a bitmap index is conceptually divided into three steps, binning, encoding, and compressing. The options for each of these steps are controlled separately. The options string has the following format

index=<binning ... /> <encoding ... /> <compressing ... />

NOTE: all keywords, such as binning, encoding and compressing, should be written lower case.

NOTE: do not leave any space between keywords and the equal sign (=) following them.

Binning Options

The basic idea of bitmap index is to generate one bitmap to represent the presence of each distinct value. This strategy is generally regarded as efficient for attributes with low cardinalities. For attributes with high cardinalities, one way to reduce the number of bitmaps is to bin the values and produce one bitmap for each bin. The value of <binning /> option can be one of the following where the separator can be either space or comma and no space is allowed around equal signs
• default
The default binning option is to generate bins with 2-digit precision bin boundaries, which is equivalent to <binning precision=2 />.
NOTE: leaving binning option blank (<binning />) is equivalent to this default binning option.
• none
Don't use bins, produce one bitmap for each distinct attribute value.
• [nbins=xxx, ]binFile="xxx"
Read the bin boundaries from the named file. If nbins is not specified, all the values are read. If nbins is specified, read nbins of the values unless the end-of-file is reached.
NOTE: The file name should be quoted. Unquoted file name can not have any special character such as '/', '>', ',', ' ' and ';'.
• nbins=xxx, equal-weight
Generate equal-weight bins, i.e., each bin would have the same number of records (or as close to equal as possible). The number of bins is specified by nbins. This options will cause the data to be scanned twice once for generating the histogram and once to actually build the index. To allow the software to sample the data instead of building the full histogram, omit 'equal-weight'.
• precision=dd
Generate bins corresponding to the reduced precision floating-point numbers. The bin boundaries will have dd significant digits. Any queries involving constants with no more than dd significant digits will be answered precisely with this index. Queries involving constants with more than dd significant digits will not be full answered with the index. FastBit will have to scan the base data to fully resolve them.
• start=xxx, end=xxx, nbins=xxx, scale=linear|log|simple
Dividing the range [start, end] into nbins either using log scale or linear scale. If scale is not specified, it is assumed to be linear, unless start and end are specified and are both positive and start < end/nbins. In which case, scale is assumed to be logarithm. This assumption is reasonable since the user has gone through the trouble of specifying a small start value which otherwise can more conveniently be expressed as zero.
If either start or end is missing, we scan the data once to determine the actual minimum and maximum value. If end is not specified, it will be the maximum value, if the start is not specified, it will be the minimum value. For integer variables, if (end-start) < nbins, each integer value would be placed in its own bin. Two additional bins are also generated, one for values less than start and the other for values larger than end.
Following C/C++ convention, the range [start, end] includes start but excludes end. Each bin also has the same convention, i.e., it includes the left boundary but excludes the right boundary. The only exception is the left most bin which is defined to be everything less than start. All fields in the format are optional. If nbins is not specified, it is assumed to be 10,000.
If the 'scale' option is not specified, the current software samples about 100 data points per bin and generates a set of equal-weight bins based on the sampled data. To get the basic equal-width bins, "scale=linear" must be specified.
The values of scale field is parsed as follows. If the first letter after the equal sign is 'l' or 'L', the value is assumed to either 'linear' or 'log'. If the second letter is 'o' or 'O', it will be treated as 'log', otherwise, it is assumed to be 'linear'. If 'scale=' is present, but the letter following the equal sign is not 'l', then the option is assumed to be 'scale=simple'. With 'linear' and 'log' options, the current software attempts to generate bin boundaries that have short decimal point representation. The option 'simple' tell the software to generate a simpler equal-width bin boundaries without attempting to reduce the decimal representation of the bin boundaries.
Note: NO space is allowed around the equal sign. The keywords 'start', 'end', 'nbins' and 'scale' must be spelled as shown here. The parser for dealing with bin specifications is very limited!
• (start, end, nbins, scale)(start, end, nbins, scale)...
This form is a generalization of the above one. For each range [start, end), it generates nbins if possible. The values start and end must be specified. The default value for nbins in this case is one (1) and the default value for scale is linear.
It is possible to neglect the keywords in this form, i.e., there is not need to say (start=x1, end=x2, nbins=n1, scale=linear), but simply (x1, x2, n1, linear). The restriction is that the four fields must appear in the above order.

When binning is used, it is possible to also produce a set of file containing reordered values (according to the bin number) so as to reduce the time required for candidate check. Add 'reorder' to the binning specification to invoke this option. At this point, reorder only works with 1-component equality encoding and 1-component range encoding.

Bitmap Encoding Options

After the values are divided into bins, the next step is to encode values as bitmaps. The simplest encoding scheme is to have one bitmap for each bin. In this case, a bit value of one (1) indicates an entry is in a specified bin. The current bitmap indexing software supports a number of different encoding schemes.
• equality
The equality encoding. The basic bitmap indexing scheme proposed is a bitmap index with no binning and equality encoding.
• range
This encoding scheme uses the same number of bitmaps as the equality encoding, but is designed to make range queries more efficiently at the cost of more disk storage. With the range encoding most of the bitmaps have enough 1s in them to make them unlikely to be compressed. On large cardinality columns, a range-encoded index could take up a lot more space than the base data or the equality-encoded index. This encoding method is only recommended for columns with low column cardinalities.
• interval
This encoding scheme uses about half as many bitmaps as equality encoding, and is also efficient for range queries. An interval-encoded index takes about half as much space as the range encoded index, however, it may still be many times larger than the corresponding equality-encoded index. This encoding method is only recommended for columns with low column cardinalities.
• ncomp=ddd
The above three encoding schemes can take ncomp as the secondary option to indicate the number of components in the encoding. When this option is not specified, the number of components is taken to be one. The maximum number of components is the logarithm of the number of bins, which leads to binary encoding. However, the special implementation of the binary encoding has further optimization to take advantage of the fact that every component has basis size of two.
• equality-equality
• interval-equality
• range-equality
Two-level bitmap encodings. Use the equality encoding at the fine level to control the index size; use a coarse level to reduce the number of bitmaps accessed to answer a query.

Compression Options

Only WAH compression is supported in the public release of FastBit. All bitmaps are compressed at construction time. Options are provided to uncompress some or all of them. Supported options are
• uncompressDenseBitmaps [density > ddd]
Uncompress those bitmaps with bit density greater than the specified value. If a density is not specified, it is assumed to be 1/8, i.e., if more than one in eight bits are one, then the bitmap is uncompressed. In some cases, decompressing dense bitmaps may improve overall query processing speed.
• uncompressLargeBitmaps [compration > ddd]
Uncompress those bitmaps with compression ratio greater than the specified value. If the compression value is not specified, it is assumed to be 0.75, i.e., if the compressed bitmap takes more than three quarters of the space of the uncompressed one, uncompress it.
• uncompressAll
Uncompress all bitmaps. Used only for debugging purposes.
It is usually a good idea to leave out the compression option and let the software compress all the bitmaps.

Additional options

There is a set of options in the form of 'index=xxx', most of which should be considered deprecated. However, some of them are very useful. The following is a short list.
• index=none
DO NOT use any index.
• index=basic
Equivalent to index=<binning none/>.
• index=binary
The binary encoding scheme. In this case, a bin number is represented in a binary form, and each bit of the binary value is placed in a bitmap. This encoding scheme uses the minimal number of bitmaps.
• index=bit-slice
This is a special version of the binary encoding for unsigned integers. This version directly uses the binary digits of the base data. This version closely resembles the bit-slice index proposed by Patrick O'Neil.
• index=bak2 [precision=p]
• index=bak [precision=p]
Generate bins where each bin contains the values that rounds the same value. For example, if the precision is 2 (the default value if the precision is not specified), then the values in each bin will have the same 2-digit decimal value after rounding. These bins can be generated with one-pass through the data and are useful in data with a wide range of distributions.
The difference between 'bak' and 'bak2' is that 'bak2' splits each bin produced by 'bak' in two so that the 2-digit decimal value itself is a bin boundary. Normally, one should use 'bak2' rather than 'bak'.
This option is superseded by <binning precision=d/>.
• index=bin/bin
• index=bin/range
• index=range/bin
• index=range/range
These four options are left over from an attempt to generate multi-level bitmap indexes. They are two-level schemes that uses different encoding schemes with each level.
Use <encoding equality-equality/>, <encoding interval-equality/> or <encoding range-equality/> instead.

Options for building a keyword index

A keyword index may be built for a string column (not for categorical values). Currently, this keyword index is built from a .tdlist file that contains term-document matrix in the form of
term: docid, docid, docid
Since the document identifiers (docid) are provided as values only, FastBit needs to know the column name of the document ids to correctly translate the tdlist into bitmaps. The recommended way to specify the name of document identifier is to use an index entry in file -part.txt, such as
index=keywords, docidname=mid
Note: the document identifiers must be 4-byte integers. If the docidname field is not specified, the integer values in the term-document list are taken to be the row numbers (starting from 0). This alternative option can be potentially faster because it requires less internal lookups to map the ids to bitmaps, but it is less flexible. For example, if you reorder the data or delete some rows, the term-document list will be wrong.

Alternatively, the docidname may be specified in an RC file (say, ibis.rc) in the form of

table-name.column-name.docIDName=mid

Note about unbinned indexes: If the explicit minimum and maximum values are known, FastBit IBIS implementation may be able to avoid storing the distinct values. This feature is implemented through ibis::direkte and is invoked through ibis::index::create.

Some Suggestions

  1. The first option to try might be to just leave it blank or not specified any indexing options at all. This invokes the default option in FastBit, which has been tuned to some extend and can work reasonably well. Of course, as you get to know you data and queries better, you might want to specify something a little more intelligent.

  2. The key parameter to consider when deciding what indexing option to use is the column cardinality, which we will simply call cardinality in the following.

  3. If the cardinality of a column is no more than 100, the binning option should be none. The following two might be reasonable choices.
    <binning none/><encoding interval/>
    <binning none/><encoding equality/>
    
    The main trade-off between them is that the interval encoding is better suited for range conditions, while the equality encoding is better suited for equality conditions. If you know your query conditions, then you can pick the one or the other. However, if you don't what type of query conditions are going to be used, use interval encoding is a reasonable choice. On the other hand, if you are always going to deal with one-sided range conditions, such as "A < 5" or "A >= 10", then it is best to use the range encoding
    <binning none/><encoding range/>
    
    In short, if you have a situation where the cardinality is very low, you can choose an encoding based on your query workload.
  4. If the cardinality is less than say 1 million (more precisely, N/10, where N is the number of rows), our recommendation is
    <binning none/><encoding interval-equality/>
    
    This option has been proven to be quite effective in tests and analyses.
  5. If the cardinality is higher than 1 million, we recommend
    <binning none><encoding binary/>
    
    This produces what is known as the bit-sliced index. Alternatively, you may bin the values, a situation we will considered next.
  6. For floating-point values, it is most likely you don't know exactly what is the column cardinality. In this case, it is natural to assume the cardinality is very high, and bin the values. Next, we discuss two reasonable approaches to binning, one is to specify the number of bins, and the other to specify the precision of the bins. Of course it is also possible to tell FastBit exactly where to put the bin boundaries as indicated above.

    If you specify the number of bins, e.g.,

    <binning nbins=2000/><encoding interval-equality/>
    
    then FastBit builds a set of approximate equal-weight bins, i.e., each bin will contain approximate the same number of rows. This minimize the worst case behavior during query answering. This is appropriate if you don't know what the query conditions are going to be like. However, if you do know something about the query conditions, for example, the precision of the constants used in the query expressions, then the next option is better.

    If the constants in the query conditions always have a relatively small number of significant digits, for example, the following query conditions use constants with 2 significant digits, "A > 1.5e6" or "A < 9.6E-3", in which case, the following binning specification will guarantee all such query conditions be answered using index only (therefore fast).

    <binning precision=2/><encoding interval-equality/>
    

 
If none of the above apply to your situation, please look through the archieves of FastBit mailing list or post your question to the mailing list.

Further Reading

A description of the binning, encoding and compression techniques used in FastBit is available in LBNL-59952. Additional research publications can be found in FastBit publication list. Technical details of the FastBit index implementation is fully documented in index.h.