FastBit Front Page | Research Publications | Software Documentation | Software Download | Software License |
Organization: LBNL » CRD » SDM » FastBit » Documentation » Indexing Options
The process of building a bitmap index is conceptually divided into three steps, binning, encoding, and compressing. The options for each of these steps are controlled separately. The options string has the following format
index=<binning ... /> <encoding ... /> <compressing ... />
NOTE: all keywords, such as binning, encoding and compressing, should be written lower case.
NOTE: do not leave any space between keywords and the equal sign (=) following them.
<binning />
option can be
one of the following where the separator can be either space or comma
and no space is allowed around equal signs
default
<binning
precision=2 />
.
<binning
/>
) is equivalent to this default binning option.
none
[nbins=xxx, ]binFile="xxx"
nbins=xxx, equal-weight
nbins
. This options will
cause the data to be scanned twice once for generating the
histogram and once to actually build the index. To allow the
software to sample the data instead of building the full histogram,
omit 'equal-weight
'.
precision=dd
dd
significant
digits. Any queries involving constants with no more than
dd
significant digits will be answered precisely with
this index. Queries involving constants with more than
dd
significant digits will not be full answered with
the index. FastBit will have to scan the base data to fully
resolve them.
start=xxx, end=xxx, nbins=xxx, scale=linear|log|simple
[start, end]
into
nbins
either using log scale or linear scale. If
scale is not specified, it is assumed to be linear, unless start
and end are specified and are both positive and start <
end/nbins
. In which case, scale is assumed to be
logarithm. This assumption is reasonable since the user has gone
through the trouble of specifying a small start value which
otherwise can more conveniently be expressed as zero.
start
or end
is missing, we
scan the data once to determine the actual minimum and maximum
value. If end is not specified, it will be the maximum value, if
the start is not specified, it will be the minimum value. For
integer variables, if (end-start) < nbins
, each
integer value would be placed in its own bin. Two additional bins
are also generated, one for values less than start and the other
for values larger than end.
[start, end]
includes start
but excludes end
. Each
bin also has the same convention, i.e., it includes the left
boundary but excludes the right boundary. The only exception is
the left most bin which is defined to be everything less than
start
. All fields in the format are optional. If
nbins
is not specified, it is assumed to be 10,000.
scale
' option is not specified, the current
software samples about 100 data points per bin and generates a set
of equal-weight bins based on the sampled data. To get the basic
equal-width bins, "scale=linear
" must be specified.
scale
field is parsed as follows. If
the first letter after the equal sign is 'l' or 'L', the value is
assumed to either 'linear' or 'log'. If the second letter is 'o'
or 'O', it will be treated as 'log', otherwise, it is assumed to
be 'linear'. If 'scale=
' is present, but the letter
following the equal sign is not 'l', then the option is assumed to
be 'scale=simple
'. With 'linear
' and
'log
' options, the current software attempts to
generate bin boundaries that have short decimal point
representation. The option 'simple
' tell the
software to generate a simpler equal-width bin boundaries without
attempting to reduce the decimal representation of the bin
boundaries.
start
', 'end
',
'nbins
' and 'scale
' must be spelled as
shown here. The parser for dealing with bin specifications is very
limited!
(start, end, nbins, scale)(start, end, nbins, scale)...
[start, end)
, it generates nbins
if
possible. The values start
and end
must
be specified. The default value for nbins
in this
case is one (1) and the default value for scale
is
linear
.
(start=x1, end=x2, nbins=n1,
scale=linear)
, but simply (x1, x2, n1,
linear)
. The restriction is that the four fields must
appear in the above order.
When binning is used, it is possible to also produce a set of file
containing reordered values (according to the bin number) so as to
reduce the time required for candidate check. Add
'reorder
' to the binning specification to invoke this
option. At this point, reorder only works with 1-component equality
encoding and 1-component range encoding.
equality
range
interval
ncomp=ddd
ncomp
as the
secondary option to indicate the number of components in the
encoding. When this option is not specified, the number of
components is taken to be one. The maximum number of components
is the logarithm of the number of bins, which leads to binary
encoding. However, the special implementation of the binary
encoding has further optimization to take advantage of the fact
that every component has basis size of two.
equality-equality
interval-equality
range-equality
uncompressDenseBitmaps [density > ddd]
uncompressLargeBitmaps [compration > ddd]
uncompressAll
index=none
index=basic
index=binary
index=bit-slice
index=bak2 [precision=p]
index=bak [precision=p]
bak
' and 'bak2
'
is that 'bak2
' splits each bin produced by
'bak
' in two so that the 2-digit decimal value itself
is a bin boundary. Normally, one should use 'bak2
'
rather than 'bak
'.
<binning
precision=d/>
.
index=bin/bin
index=bin/range
index=range/bin
index=range/range
<encoding equality-equality/>
,
<encoding interval-equality/>
or
<encoding range-equality/>
instead.
term: docid, docid, docidSince the document identifiers (docid) are provided as values only, FastBit needs to know the column name of the document ids to correctly translate the tdlist into bitmaps. The recommended way to specify the name of document identifier is to use an index entry in file
-part.txt
, such as
index=keywords, docidname=midNote: the document identifiers must be 4-byte integers. If the docidname field is not specified, the integer values in the term-document list are taken to be the row numbers (starting from 0). This alternative option can be potentially faster because it requires less internal lookups to map the ids to bitmaps, but it is less flexible. For example, if you reorder the data or delete some rows, the term-document list will be wrong.
Alternatively, the docidname may be specified in an RC file (say, ibis.rc) in the form of
table-name.column-name.docIDName=mid
Note about unbinned indexes: If the explicit minimum and maximum values are known, FastBit IBIS implementation may be able to avoid storing the distinct values. This feature is implemented through ibis::direkte and is invoked through ibis::index::create.
The key parameter to consider when deciding what indexing option to use is the column cardinality, which we will simply call cardinality in the following.
none
. The following two might be
reasonable choices.
<binning none/><encoding interval/> <binning none/><encoding equality/>The main trade-off between them is that the interval encoding is better suited for range conditions, while the equality encoding is better suited for equality conditions. If you know your query conditions, then you can pick the one or the other. However, if you don't what type of query conditions are going to be used, use interval encoding is a reasonable choice. On the other hand, if you are always going to deal with one-sided range conditions, such as "A < 5" or "A >= 10", then it is best to use the range encoding
<binning none/><encoding range/>In short, if you have a situation where the cardinality is very low, you can choose an encoding based on your query workload.
<binning none/><encoding interval-equality/>This option has been proven to be quite effective in tests and analyses.
<binning none><encoding binary/>This produces what is known as the bit-sliced index. Alternatively, you may bin the values, a situation we will considered next.
If you specify the number of bins, e.g.,
<binning nbins=2000/><encoding interval-equality/>then FastBit builds a set of approximate equal-weight bins, i.e., each bin will contain approximate the same number of rows. This minimize the worst case behavior during query answering. This is appropriate if you don't know what the query conditions are going to be like. However, if you do know something about the query conditions, for example, the precision of the constants used in the query expressions, then the next option is better.
If the constants in the query conditions always have a relatively small number of significant digits, for example, the following query conditions use constants with 2 significant digits, "A > 1.5e6" or "A < 9.6E-3", in which case, the following binning specification will guarantee all such query conditions be answered using index only (therefore fast).
<binning precision=2/><encoding interval-equality/>
A description of the binning, encoding and compression techniques used in FastBit is available in LBNL-59952. Additional research publications can be found in FastBit publication list. Technical details of the FastBit index implementation is fully documented in index.h.