|FastBit Front Page||Research Publications||Software Documentation||Software Download||Software License|
Organization: LBNL » CRD » SDM » FastBit » Documentation » Indexing Options
The process of building a bitmap index is conceptually divided into three steps, binning, encoding, and compressing. The options for each of these steps are controlled separately. The options string has the following format
index=<binning ... /> <encoding ... /> <compressing ... />
NOTE: all keywords, such as binning, encoding and compressing, should be written lower case.
NOTE: do not leave any space between keywords and the equal sign (=) following them.
<binning />option can be one of the following where the separator can be either space or comma and no space is allowed around equal signs
<binning precision=2 />.
<binning />) is equivalent to this default binning option.
nbins. This options will cause the data to be scanned twice once for generating the histogram and once to actually build the index. To allow the software to sample the data instead of building the full histogram, omit '
ddsignificant digits. Any queries involving constants with no more than
ddsignificant digits will be answered precisely with this index. Queries involving constants with more than
ddsignificant digits will not be full answered with the index. FastBit will have to scan the base data to fully resolve them.
start=xxx, end=xxx, nbins=xxx, scale=linear|log|simple
nbinseither using log scale or linear scale. If scale is not specified, it is assumed to be linear, unless start and end are specified and are both positive and
start < end/nbins. In which case, scale is assumed to be logarithm. This assumption is reasonable since the user has gone through the trouble of specifying a small start value which otherwise can more conveniently be expressed as zero.
endis missing, we scan the data once to determine the actual minimum and maximum value. If end is not specified, it will be the maximum value, if the start is not specified, it will be the minimum value. For integer variables, if
(end-start) < nbins, each integer value would be placed in its own bin. Two additional bins are also generated, one for values less than start and the other for values larger than end.
end. Each bin also has the same convention, i.e., it includes the left boundary but excludes the right boundary. The only exception is the left most bin which is defined to be everything less than
start. All fields in the format are optional. If
nbinsis not specified, it is assumed to be 10,000.
scale' option is not specified, the current software samples about 100 data points per bin and generates a set of equal-weight bins based on the sampled data. To get the basic equal-width bins, "
scale=linear" must be specified.
scalefield is parsed as follows. If the first letter after the equal sign is 'l' or 'L', the value is assumed to either 'linear' or 'log'. If the second letter is 'o' or 'O', it will be treated as 'log', otherwise, it is assumed to be 'linear'. If '
scale=' is present, but the letter following the equal sign is not 'l', then the option is assumed to be '
scale=simple'. With '
linear' and '
log' options, the current software attempts to generate bin boundaries that have short decimal point representation. The option '
simple' tell the software to generate a simpler equal-width bin boundaries without attempting to reduce the decimal representation of the bin boundaries.
nbins' and '
scale' must be spelled as shown here. The parser for dealing with bin specifications is very limited!
(start, end, nbins, scale)(start, end, nbins, scale)...
[start, end), it generates
nbinsif possible. The values
endmust be specified. The default value for
nbinsin this case is one (1) and the default value for
(start=x1, end=x2, nbins=n1, scale=linear), but simply
(x1, x2, n1, linear). The restriction is that the four fields must appear in the above order.
When binning is used, it is possible to also produce a set of file
containing reordered values (according to the bin number) so as to
reduce the time required for candidate check. Add
reorder' to the binning specification to invoke this
option. At this point, reorder only works with 1-component equality
encoding and 1-component range encoding.
ncompas the secondary option to indicate the number of components in the encoding. When this option is not specified, the number of components is taken to be one. The maximum number of components is the logarithm of the number of bins, which leads to binary encoding. However, the special implementation of the binary encoding has further optimization to take advantage of the fact that every component has basis size of two.
uncompressDenseBitmaps [density > ddd]
uncompressLargeBitmaps [compration > ddd]
bak' and '
bak2' is that '
bak2' splits each bin produced by '
bak' in two so that the 2-digit decimal value itself is a bin boundary. Normally, one should use '
bak2' rather than '
term: docid, docid, docidSince the document identifiers (docid) are provided as values only, FastBit needs to know the column name of the document ids to correctly translate the tdlist into bitmaps. The recommended way to specify the name of document identifier is to use an index entry in file
-part.txt, such as
index=keywords, docidname=midNote: the document identifiers must be 4-byte integers. If the docidname field is not specified, the integer values in the term-document list are taken to be the row numbers (starting from 0). This alternative option can be potentially faster because it requires less internal lookups to map the ids to bitmaps, but it is less flexible. For example, if you reorder the data or delete some rows, the term-document list will be wrong.
Alternatively, the docidname may be specified in an RC file (say, ibis.rc) in the form of
Note about unbinned indexes: If the explicit minimum and maximum values are known, FastBit IBIS implementation may be able to avoid storing the distinct values. This feature is implemented through ibis::direkte and is invoked through ibis::index::create.
The key parameter to consider when deciding what indexing option to use is the column cardinality, which we will simply call cardinality in the following.
none. The following two might be reasonable choices.
<binning none/><encoding interval/> <binning none/><encoding equality/>The main trade-off between them is that the interval encoding is better suited for range conditions, while the equality encoding is better suited for equality conditions. If you know your query conditions, then you can pick the one or the other. However, if you don't what type of query conditions are going to be used, use interval encoding is a reasonable choice. On the other hand, if you are always going to deal with one-sided range conditions, such as "A < 5" or "A >= 10", then it is best to use the range encoding
<binning none/><encoding range/>In short, if you have a situation where the cardinality is very low, you can choose an encoding based on your query workload.
<binning none/><encoding interval-equality/>This option has been proven to be quite effective in tests and analyses.
<binning none><encoding binary/>This produces what is known as the bit-sliced index. Alternatively, you may bin the values, a situation we will considered next.
If you specify the number of bins, e.g.,
<binning nbins=2000/><encoding interval-equality/>then FastBit builds a set of approximate equal-weight bins, i.e., each bin will contain approximate the same number of rows. This minimize the worst case behavior during query answering. This is appropriate if you don't know what the query conditions are going to be like. However, if you do know something about the query conditions, for example, the precision of the constants used in the query expressions, then the next option is better.
If the constants in the query conditions always have a relatively small number of significant digits, for example, the following query conditions use constants with 2 significant digits, "A > 1.5e6" or "A < 9.6E-3", in which case, the following binning specification will guarantee all such query conditions be answered using index only (therefore fast).
<binning precision=2/><encoding interval-equality/>
A description of the binning, encoding and compression techniques used in FastBit is available in LBNL-59952. Additional research publications can be found in FastBit publication list. Technical details of the FastBit index implementation is fully documented in index.h.