FastBit Front Page | Research Publications | Software Documentation | Software Download | Software License |
Organization: LBNL » CRD » SDM » FastBit » Documentation » Quick Start
This quick start outlines the steps for a few tasks that one might do with FastBit software, such as preparing data, querying, and changing indexing options. Only the basic instructioins are contained here, more detailed instructions for these tasks and information on trouble shooting are provided else where as indicated.
This section briefly desribes the FastBit data format, and the tools available for converting ASCII data files.
If you have your data in an ASCII format known as Comma Separated Values
(CSV), which many database systems utilize for input and output, you can
use the command line tool ardea
for converting the CSV
files into directories that can be used by FastBit. (The executable
ardea
is built by command make all
from the
top level directory of the FastBit source code.) The following command
line converts the file tests/test0.csv
to the FastBit data
partition in directory tmp
.
examples/ardea -d tmp -m "a:int, b:float, c:short" -t tests/test0.csvIt reads the first column as 32-bit integers with column name
a
, the second column as 32-bit floating-point values with
column name b
, and the third column as 16-bit integer
values with column name c
. The resulting binary files and
the metadata file are writtent to directory tmp
. The
following is a listing of the files in tmp
. The file sizes
of a
, b
and c
should be exactly
400 bytes, 400 bytes and 200 bytes as shown.
-rw-r--r-- 1 kwu Users 402 Jul 30 13:38 -part.txt -rw-r--r-- 1 kwu Users 400 Jul 30 13:38 a -rw-r--r-- 1 kwu Users 400 Jul 30 13:38 b -rw-r--r-- 1 kwu Users 200 Jul 30 13:38 c
For additional information about preparing data, read dataLoading.html.
FastBit treats data as tables with rows and columns. A large table may be partitioned into many data partitions. To prepare data for FastBit, one needs to build these partition separately. Each partition is stored in a directory in you file system, with each column stored as a separated file in raw binary form. The name of the data file is the name of the column. The column name can only contain alphanumeric characters plus the underscore (_) and must start with an alphabet. Furthermore, all column names are case insensitive.
There must be a metadata file named -part.txt
in the directory
for a data partition. This file contains information such as the name
of the partition, the number of rows in the partition, the number of
columns, column names and so on. Here is an example with the minimal
necessary information,
BEGIN HEADER name=testData Number_of_rows=1000000 Number_of_columns=2 END HEADER BEGIN Column name=f1 data_type=float END Column BEGIN Column name=j2 data_type=unsigned END Column
Once the binary data files and the metadata file -part.txt
are
in place, FastBit can make use of the data partition and we can query
the data with the command line tool named ibis
built from
examples/ibis.cpp
.
FastBit typically reads a whole data file containing a column into memory when part of the file is needed. This imposes a limit on the number of rows that can be stored in a partition. In addition, the size of a partition is internally recorded with a 32-bit unsigned integer, which has a limit of 232 rows, which is a hard upper bound on the number rows for a partition. Typically, for a machine with a few gigabytes of memory, we recommend a data partition to contain between 1 million and 100 million rows.
Larger samples and more usage examples
After preparing a data partition, we can try some queries.
Both programs ibis
and thula
can be used accomplish
this task. Assuming directry tmp
has been prepared with the above
ardea
command line, the following two commands should both
produce 1 hit.
examples/ibis -d tmp -q "where a = 0" examples/thula -d tmp -w "a = 0"The following two commands both produce 9 hits.
examples/ibis -d tmp -q "where a = b and c < 10" examples/thula -d tmp -w "a = b and c < 10"
The above commands directly use the data directories specified on
command line. To specify more than one direcoty per command or to
specify additional parameters to control the execution of FastBit, one
may use a configuration file. For example, the following configuration
file specifies two data directories and tell FastBit to store temporary
files in /tmp/FastBitCache
.
timestep1.dataDir=/data/jwu/ts1 timestep2.dataDir=/data/jwu/ts2 CacheDirectory=/tmp/FastBitCacheThe configuration file can be passed to any command line tool with option
-c rc-filename
.
With all command line tools, a useful option is -h
, which will
cause them to print out usage information. Another useful option is
-v
, which can be used to instruct them to print out more
information about their progress. Multiple -v
options can
be used if more verbose output is desired.
Note that both ibis
and thula
answer queries, but
they exercise different interfaces to the underlying indexing functions. In
addition, ibis
also supports a lot more operations. A
description of these functions are available in ibisCommandLine.html.
FastBit implements more than a dozen different bitmap indexes and also
offers a number of ways to control which one is used for query
processing. An easy way to build index is to use the ibis
program, such as
examples/ibis -v -d tmp -b "<binning none/><encoding equality/>"which builds the basic bitmap index (with no binning, equality encoding), other possible indexes are described in indexSpec.html.
When an index exists, the above command does not check whether the
existing index is of the specified type. To remove the existing indexes
and build new ones, add -z
in addition to option
-b
. The existing query processing function can only work
with one index per column (per data partition). There is no way to
build multiple indexes for one column in a data partition.
Alternatively, one may manually remove all index (*.idx
) files
from a directory or edit the file -part.txt
to specify
indexing options to the whole partition or a specific column. For
example, the following -part.txt
modifies the above one by
adding an indexing option to the whole partition and one for column
j2
.
BEGIN HEADER name=testData Number_of_rows=1000000 Number_of_columns=2 index=<binning precision=2/><encoding equality/> END HEADER BEGIN Column name=f1 data_type=float END Column BEGIN Column name=j2 data_type=unsigned index=<binning none/><encoding equality/> END Column
All useful FastBit functions and classes are in the name space of
ibis
. There are two levels of interfaces, one is more abstract
and the other is more concrete. The abstract interface is implemented as
class ibis::table
, and the concrete interface is implemented as
class ibis::part
(a shorthand for partition). We will next
briefly describe a handful of key functions from these two classes.
Note that before doing anything, one should call ibis::init
to
read the configuration file if there is one.
ibis::table
One normally instantiates an ibis::table
object by calling
the function ibis::table::create
with the data directory as the
argument. This is done in file example/thula.cpp
.
There are two functions, ibis::table::estimate
, which gives a
lower and a upper bound on the number of hits, and
ibis::table::select
, which produces another table containing
the selected values. Both of these functions are documented in table.h.
The following is a snipet of source code that builds an
ibis::table
object from directory tmp
, and estimate
the number of hits of query condition "a = 0".
ibis::table *tbl = ibis::table::create("tmp"); uint64_t nhmin, nhmax; tbl->estimate("a = 0", nhmin, nhmax);For a more realistic example, see file
examples/thula.cpp
.
Additional information about the functions defined in
ibis::table
and related classes can be found in table.h.
ibis::part
The more concrete interface is represented by class ibis::part
.
Each ibis::part
corresponds to exactly one data directory
described above. Currently, the constructor of ibis::part
requires two arguments both being directory names. It is safe to
pass the data directory name as the first argument and pass a nil
pointer as the second argument, see examples/ibis.cpp
for an
example.
To process any query with ibis::part
, one needs to instantiate
ibis::query
objects. The following code snippet demonstrates
how to create an ibis::part
from directory tmp
, an
ibis::query
object for the query condition "a = b and c < 10",
then evaluate the query and find out the number of hits.
ibis::part par("tmp", 0); ibis::query que("username", &par); que.setWhereClause("a = b and c < 10"); que.evaluate(); long nhits = que.getNumHits();The class
ibis::part
is documented in part.h,
and the class ibis::query
is documented in
query.h.
A more realistic example can be found in examples/ibis.cpp
.
ibis::index
The most interesting component of FastBit is the class hierarchy of
ibis::index
. A potentially useful task is to extend these
compressed bitmap indexes in this class hierarchy. The class
ibis::index
is documented in index.h.
Inside FastBit, an index is created through the class
ibis::column::indexLock
, which in turn calls
ibis::column::loadIndex
, which then invokes the index factory
ibis::index::create
. The main reason for using
ibis::column::indexLock
is to track how many
queries are using an index simultaneously. It is possible to bypass all
these layers of code by directly invoking the constructor of a concrete
index class.
There are two types of indexes in FastBit, binned and unbinned. The
binned class all derive from ibis::bin
, and most of the
unbinned classes derive from ibis::relic
that implements the
basic bitmap index. The class ibis::bin
is defined in ibin.h
and the class ibis::relic
is defined in irelic.h.
Note that all concrete index classes are defined in files starting with
letter i
.