FastBit Front Page | Research Publications | Software Documentation | Software Download | Software License |
Organization: LBNL » CRD » SDM » FastBit » Documentation » Data Loading
In quickstart.html, a brief intruction for the loading data into FastBit was given. In this document, we give further details for those who want to know the internal structures in order to make more extensive uses of FastBit.
Sample data and usage examples
Recall that FastBit stores a data table in multiple partitions and each
partition is stored in one directory on the file system. Next, we will
first briefly describe how to use ardea
to to create this
directory, then give more details of the files in this directory, and
finally end with some advices on how different functions can be used to
integrate data into existing tables.
Starting with ASCII form of data in the Comma-Separated Values (CSV)
format, there is a command line tool named ardea
that can
digest the data and create the raw binary files and metadata files for
FastBit. Very briefly, a CSV file contains values of a data table in
ASCII format with commas (and possibly white spaces) as delimiters. It
has all columns of a row on one line, and usually does not contain
neither names of the columns or types of the columns. Therefore it is
necessary to supply this information from elsewhere. The executable
ardea
has four options that deals with reading CSV data.
-d output-dir
-d
option is expected, if multiple of them are
specified, the last one overwrite all previous ones.
-n name-of-partition
-m name:type[, name:type, ...]
-m
are specified, they are concatenated together.
-t text-filename
-t
options may be used to specify multiple text files.
-r a-row-in-text-form
-r
options. In
ardea
text files are processed before individual rows
specified by option -r
.
-M metadatafilename
-m
option, or the more verbose form used in '-part.txt'
files.
-tags "name1=value1, name2=value2, ..."
-b delimiters
-b ,
"
explicitly. Consecutive apparences of delimiters are treated a single
delimiter. There is NO support for multi-character delimiter.
The executable ardea
supports a subset of the data types
that are supported by query processing functions. The following is a
complete list of data types supported by ardea
byte
: 8-bit signed integer.
short
: 16-bit signed integer.
int
: 32-bit signed integer.
long
: 64-bit signed integer.
unsigned byte
: unsigned 8-bit signed integer.
unsigned short
: unsigned 16-bit signed integer.
unsigned int
: unsigned 32-bit signed integer.
unsigned long
: unsigned 64-bit signed integer.
float
: 32-bit IEEE floating-point values.
double
: 64-bit IEEE floating-point values.
key
: String values with a small number of distinct choices.
text
: Arbitrary string values.
ardea
only test the first letter in
these cases. The first letter can be in either upper or lower case.
Therefore the example given
in quickstart.html can be shorten slightly
as
examples/ardea -d tmp -m "a:i,b:f,c:s" -t tests/test0.csv
NOTE: String values containing blank spaces must quoted because blank space is assumed to be delimiter as well.
NOTE: Quotes inside strings can be escaped with backslash.
If the data directory specified in -d
option already
contains some data, the new rows are appended to the existing records.
If there is any mismatching in column names, NULL values will be used to
pad the rows. If there is any mismatching in the data types, the new
data type will be used and the existing records will be left unchanged.
Only differences of between signed integer and unsigned integers are
allowed, other mismatches will be flagged and nothing will be written.
The CSV files in directory tests
are not exactly the
standard CSV files, they contain an extra header line in each file. The
command ardea
skips them because they can not be properly
parsed into the data types specified. These header lines are here to
help the program in directory tests
named
readcsv
. In most cases, the CSV files produced by other
systems will not have this extra header line. We mention the program
readcsv
because it could serve as an example for those who
want write their own program to read data for FastBit. The source code
for readcsv
is tests/readcsv.cpp
.
-part.txt
. For example,
after building the indexes in the directory tmp
generated
by the above commend, we have the following files,
-rw-r--r-- 1 kwu Users 402 Aug 3 20:35 -part.txt -rw-r--r-- 1 kwu Users 400 Aug 3 20:35 a -rw-r--r-- 1 kwu Users 3520 Aug 4 23:14 a.idx -rw-r--r-- 1 kwu Users 400 Aug 3 20:35 b -rw-r--r-- 1 kwu Users 3520 Aug 4 23:14 b.idx -rw-r--r-- 1 kwu Users 200 Aug 3 20:35 c -rw-r--r-- 1 kwu Users 3520 Aug 4 23:14 c.idxIn this listing, we see three files with the column names we've given on the command line
a
, b
and c
. The
metadata file -part.txt
and three index files named
a.idx
, b.idx
and c.idx
. Each
index file contains all information necessary to reconstruct a bitmap
index, including the bitmaps and the keyvalues associated with each
bitmap. As described in LBNL-62756, the
bitmaps and the keyvalues are densely packed. This design allows us to
answer typicaly range queries efficiently. The specific details of each
bitmap index is documented with the function write
in each
index class.
In addition to the above files, there are a few other files that FastBit uses. If one plans to share the FastBit data directory with other applications, make sure none of them use the same names as FastBit.
-part.txt
: the metadata file for a data partition.
During certain operations, FastBit may store additional information into
this file. If you have a pre-release version of FastBit, this file may
be called table.tdc
.
colname
: name of the data file for a column. Note that
column names can not contain dot (.) or space, see full list of restrictions.
colname.idx
: name of the index file associated with
column colname
. There can be one index for each column
following this naming convention.
colname.msk
: the bit vector containing the mask for
null values of the column. The i
th bit of this vector is
set to 1
if the value of this column is not null.
colname.bin
: a reorder version of the values. When
values are binned, it is possible to generate this file to speed up
certain search operations.
colname.sp
: starting positions of strings in the raw
data file of string valued column. Note that the raw string values are
stored one after another along with their null terminators. This file
contains information to speed up the reading specific string from the
raw data file.
colname.tdlist
: the term-document list used to build
index ibis::keywords
. In this context, it is assumed that
each row of this column is assumed to text document.
colname.terms
: terms defined in
colname.tdlist
. This file is always used together with
colname.idx
.
colname.dic
: the dictionary of a low cardinality
string-valued column. The low cardinality string-valued columns, also
known as "categorical values", are essentially treated as an integer
column by translating each distinct string value to an integer.
colname.int
: the integers version of the categorical
values (translated through the dictionary).
To convert data from another format to be used by FastBit, the key data
files to produce are -part.txt
and the binary data files
for columns. Next, we describe the content of these files.
Unless your data happen to be in vertically organized binary files, you will have to do a conversion. The conversion process basically reads the data in its original format and produce a copy in raw binary form, with each column going to its own file. The following are all the necessary requirements.
i
th data value of each data file forms the
i
th row of the table. Therefore, they must be all from the
i
th row.
-part.txt
that describes the
data files in the directory. The required information includes
We mentioned two examples program for converting ASCII data into
vertically partitioned raw binary data files before,
tests/readcsv.cpp
and examples/readcsv.cpp
.
One may follow these examples to develop new one.
dataDirectory = full-path-to-the-data-directory backupDirectory = full-path-to-the-backup-directoryA sample configuration file can be found at the end of this document. Typically, the
dataDirectory
is a
parent directory that contains the directories for different partitions,
where each partition is stored in its own directory. In particular, if a
new partition is created, it is created as a new directory under the
dataDirectory
.
If a configuration file is not specified, the new data will appear in a
directory named .ibis
in the current working directory.
ibis
command line toolibis
command line tool, this three steps can be
performed with one invocation using the -a option. This option
takes one mandatory arguments, which is the name of the directory
containing the data to be appended. It also take two optional arguments
in the form of to table_name, where the word
table_name is the name of a table the new data is to be
appended. The word to is used to make the command line
slightly more friendly.
ibis::part
APIappend
. It has the following
definition.
int ibis::part::append(const char *newDataDir);This function returns the number of rows appended if it is successful, otherwise it returns an negative values as error code.
The function to perform the integrity check is called
selfTest
, which is defined as follows,
int ibis::part::selfTest(int numThreads=0, const char *prefix=0) const;This function performs a set of predefined tests and return the number of failures encountered. The function may take two arguments. The first is the number of threads to be used and the second is a prefixed used to get extra control information from the configuration files. Both arguments to this function are optional.
If the integrity tests went fine, one may commit the append operation by
calling commit
, which has the following definition.
int ibis::part::commit(const char *newDataDir);Similar to function
append
, this function returns the
number of rows appended or an negative error code.
NOTE: The argument to commit
should be the
same as the argument to append
.
If the append operation failed the integrity tests, one may rollback the
append operation by calling rollback
.
int ibis::part::rollback();This function either return the number of rows removed or an negative number indicating error.
ibis::part::append
to work without the
backup directory. Of course, the function commit
and
rollback
would do nothing. If the append
operation failed, one will have to do a manual recovery. FastBit does
not have to a tool to automate this recovery at this point in time.
BEGIN HEADER DataSet.Name=testData Number_of_rows=1000000 Number_of_columns=6 Table_State=1 END HEADER BEGIN Column name=i9 description=integers 0, 1, ..., and 9 data_type=Int END Column BEGIN Column name=j1 description=integers 0 and 1 data_type=Unsigned END Column BEGIN Column name=f0 description=float values 0 and 1 data_type=Float END Column BEGIN Column name=d0 description=double 0, same as i0 data_type=Double END Column BEGIN Column name=s1 description=26-lower case alphabets data_type=Category END Column BEGIN Column name=t1 description=26-lower case alphabets, same as s1 data_type=String END Column
A sample ibis
configuration file.
dataDirectory=/data/jwu/index backupDirectory=/data/jwu/backup CacheDirectory=/tmp/QECache fileManager.maxBytes=300Mb longTests=true preferMMapIndex=1
Note, the source code distribution also distributes two very small sets of data for testing. Some larger datasets are available on-line, follow this link to download these samples.
Warning: The number of rows in each data patition is recorded with a 32-bit integer, which limits the maximum number of rows in a data partition to be 232 (~ 4 billion). However, here are a number of more limitations on the number of rows that should be put in a data partition.