FastBit
  FastBit Front Page Research Publications Software Documentation Software Download Software License  

Organization: LBNL » CRD » SDM » FastBit » Documentation » Quick Start

FastBit Quick Start

This quick start outlines the steps for a few tasks that one might do with FastBit software, such as preparing data, querying, and changing indexing options. Only the basic instructioins are contained here, more detailed instructions for these tasks and information on trouble shooting are provided else where as indicated.

Preparing Data for FastBit

This section briefly desribes the FastBit data format, and the tools available for converting ASCII data files.

Utilities for converting data

If you have your data in an ASCII format known as Comma Separated Values (CSV), which many database systems utilize for input and output, you can use the command line tool ardea for converting the CSV files into directories that can be used by FastBit. (The executable ardea is built by command make all from the top level directory of the FastBit source code.) The following command line converts the file tests/test0.csv to the FastBit data partition in directory tmp.

examples/ardea -d tmp -m "a:int, b:float, c:short" -t tests/test0.csv
It reads the first column as 32-bit integers with column name a, the second column as 32-bit floating-point values with column name b, and the third column as 16-bit integer values with column name c. The resulting binary files and the metadata file are writtent to directory tmp. The following is a listing of the files in tmp. The file sizes of a, b and c should be exactly 400 bytes, 400 bytes and 200 bytes as shown.

-rw-r--r-- 1 kwu Users 402 Jul 30 13:38 -part.txt
-rw-r--r-- 1 kwu Users 400 Jul 30 13:38 a
-rw-r--r-- 1 kwu Users 400 Jul 30 13:38 b
-rw-r--r-- 1 kwu Users 200 Jul 30 13:38 c

For additional information about preparing data, read dataLoading.html.

FastBit Data organization

FastBit treats data as tables with rows and columns. A large table may be partitioned into many data partitions. To prepare data for FastBit, one needs to build these partition separately. Each partition is stored in a directory in you file system, with each column stored as a separated file in raw binary form. The name of the data file is the name of the column. The column name can only contain alphanumeric characters plus the underscore (_) and must start with an alphabet. Furthermore, all column names are case insensitive.

There must be a metadata file named -part.txt in the directory for a data partition. This file contains information such as the name of the partition, the number of rows in the partition, the number of columns, column names and so on. Here is an example with the minimal necessary information,

BEGIN HEADER
name=testData
Number_of_rows=1000000
Number_of_columns=2
END HEADER

BEGIN Column
name=f1
data_type=float
END Column

BEGIN Column
name=j2
data_type=unsigned
END Column

Once the binary data files and the metadata file -part.txt are in place, FastBit can make use of the data partition and we can query the data with the command line tool named ibis built from examples/ibis.cpp.

FastBit typically reads a whole data file containing a column into memory when part of the file is needed. This imposes a limit on the number of rows that can be stored in a partition. In addition, the size of a partition is internally recorded with a 32-bit unsigned integer, which has a limit of 232 rows, which is a hard upper bound on the number rows for a partition. Typically, for a machine with a few gigabytes of memory, we recommend a data partition to contain between 1 million and 100 million rows.

Larger samples and more usage examples

Querying Existing Data

After preparing a data partition, we can try some queries. Both programs ibis and thula can be used accomplish this task. Assuming directry tmp has been prepared with the above ardea command line, the following two commands should both produce 1 hit.

examples/ibis -d tmp -q "where a = 0"
examples/thula -d tmp -w "a = 0"
The following two commands both produce 9 hits.
examples/ibis -d tmp -q "where a = b and c < 10"
examples/thula -d tmp -w "a = b and c < 10"

The above commands directly use the data directories specified on command line. To specify more than one direcoty per command or to specify additional parameters to control the execution of FastBit, one may use a configuration file. For example, the following configuration file specifies two data directories and tell FastBit to store temporary files in /tmp/FastBitCache.

timestep1.dataDir=/data/jwu/ts1
timestep2.dataDir=/data/jwu/ts2
CacheDirectory=/tmp/FastBitCache
The configuration file can be passed to any command line tool with option -c rc-filename.

With all command line tools, a useful option is -h, which will cause them to print out usage information. Another useful option is -v, which can be used to instruct them to print out more information about their progress. Multiple -v options can be used if more verbose output is desired.

Note that both ibis and thula answer queries, but they exercise different interfaces to the underlying indexing functions. In addition, ibis also supports a lot more operations. A description of these functions are available in ibisCommandLine.html.

Controling the Indexes

FastBit implements more than a dozen different bitmap indexes and also offers a number of ways to control which one is used for query processing. An easy way to build index is to use the ibis program, such as

examples/ibis -v -d tmp -b "<binning none/><encoding equality/>"
which builds the basic bitmap index (with no binning, equality encoding), other possible indexes are described in indexSpec.html.

When an index exists, the above command does not check whether the existing index is of the specified type. To remove the existing indexes and build new ones, add -z in addition to option -b. The existing query processing function can only work with one index per column (per data partition). There is no way to build multiple indexes for one column in a data partition.

Alternatively, one may manually remove all index (*.idx) files from a directory or edit the file -part.txt to specify indexing options to the whole partition or a specific column. For example, the following -part.txt modifies the above one by adding an indexing option to the whole partition and one for column j2.

BEGIN HEADER
name=testData
Number_of_rows=1000000
Number_of_columns=2
index=<binning precision=2/><encoding equality/>
END HEADER

BEGIN Column
name=f1
data_type=float
END Column

BEGIN Column
name=j2
data_type=unsigned
index=<binning none/><encoding equality/>
END Column

Calling FastBit Functions

All useful FastBit functions and classes are in the name space of ibis. There are two levels of interfaces, one is more abstract and the other is more concrete. The abstract interface is implemented as class ibis::table, and the concrete interface is implemented as class ibis::part (a shorthand for partition). We will next briefly describe a handful of key functions from these two classes. Note that before doing anything, one should call ibis::init to read the configuration file if there is one.

Class ibis::table

One normally instantiates an ibis::table object by calling the function ibis::table::create with the data directory as the argument. This is done in file example/thula.cpp.

There are two functions, ibis::table::estimate, which gives a lower and a upper bound on the number of hits, and ibis::table::select, which produces another table containing the selected values. Both of these functions are documented in table.h. The following is a snipet of source code that builds an ibis::table object from directory tmp, and estimate the number of hits of query condition "a = 0".

ibis::table *tbl = ibis::table::create("tmp");
uint64_t nhmin, nhmax;
tbl->estimate("a = 0", nhmin, nhmax);
For a more realistic example, see file examples/thula.cpp.

Additional information about the functions defined in ibis::table and related classes can be found in table.h.

Class ibis::part

The more concrete interface is represented by class ibis::part. Each ibis::part corresponds to exactly one data directory described above. Currently, the constructor of ibis::part requires two arguments both being directory names. It is safe to pass the data directory name as the first argument and pass a nil pointer as the second argument, see examples/ibis.cpp for an example.

To process any query with ibis::part, one needs to instantiate ibis::query objects. The following code snippet demonstrates how to create an ibis::part from directory tmp, an ibis::query object for the query condition "a = b and c < 10", then evaluate the query and find out the number of hits.

ibis::part par("tmp", 0);
ibis::query que("username", &par);
que.setWhereClause("a = b and c < 10");
que.evaluate();
long nhits = que.getNumHits();
The class ibis::part is documented in part.h, and the class ibis::query is documented in query.h. A more realistic example can be found in examples/ibis.cpp.

Class ibis::index

The most interesting component of FastBit is the class hierarchy of ibis::index. A potentially useful task is to extend these compressed bitmap indexes in this class hierarchy. The class ibis::index is documented in index.h. Inside FastBit, an index is created through the class ibis::column::indexLock, which in turn calls ibis::column::loadIndex, which then invokes the index factory ibis::index::create. The main reason for using ibis::column::indexLock is to track how many queries are using an index simultaneously. It is possible to bypass all these layers of code by directly invoking the constructor of a concrete index class.

There are two types of indexes in FastBit, binned and unbinned. The binned class all derive from ibis::bin, and most of the unbinned classes derive from ibis::relic that implements the basic bitmap index. The class ibis::bin is defined in ibin.h and the class ibis::relic is defined in irelic.h. Note that all concrete index classes are defined in files starting with letter i.