|FastBit Front Page||Research Publications||Software Documentation||Software Download||Software License|
Organization: LBNL » CRD » SDM » FastBit » Documentation » Data Loading
In quickstart.html, a brief intruction for the loading data into FastBit was given. In this document, we give further details for those who want to know the internal structures in order to make more extensive uses of FastBit.
Sample data and usage examples
Recall that FastBit stores a data table in multiple partitions and each
partition is stored in one directory on the file system. Next, we will
first briefly describe how to use
ardea to to create this
directory, then give more details of the files in this directory, and
finally end with some advices on how different functions can be used to
integrate data into existing tables.
Starting with ASCII form of data in the Comma-Separated Values (CSV)
format, there is a command line tool named
ardea that can
digest the data and create the raw binary files and metadata files for
FastBit. Very briefly, a CSV file contains values of a data table in
ASCII format with commas (and possibly white spaces) as delimiters. It
has all columns of a row on one line, and usually does not contain
neither names of the columns or types of the columns. Therefore it is
necessary to supply this information from elsewhere. The executable
ardea has four options that deals with reading CSV data.
-doption is expected, if multiple of them are specified, the last one overwrite all previous ones.
-m name:type[, name:type, ...]
-mare specified, they are concatenated together.
-toptions may be used to specify multiple text files.
ardeatext files are processed before individual rows specified by option
-moption, or the more verbose form used in '-part.txt' files.
-tags "name1=value1, name2=value2, ..."
-b ," explicitly. Consecutive apparences of delimiters are treated a single delimiter. There is NO support for multi-character delimiter.
ardea supports a subset of the data types
that are supported by query processing functions. The following is a
complete list of data types supported by
byte: 8-bit signed integer.
short: 16-bit signed integer.
int: 32-bit signed integer.
long: 64-bit signed integer.
unsigned byte: unsigned 8-bit signed integer.
unsigned short: unsigned 16-bit signed integer.
unsigned int: unsigned 32-bit signed integer.
unsigned long: unsigned 64-bit signed integer.
float: 32-bit IEEE floating-point values.
double: 64-bit IEEE floating-point values.
key: String values with a small number of distinct choices.
text: Arbitrary string values.
ardeaonly test the first letter in these cases. The first letter can be in either upper or lower case. Therefore the example given in quickstart.html can be shorten slightly as
examples/ardea -d tmp -m "a:i,b:f,c:s" -t tests/test0.csv
NOTE: String values containing blank spaces must quoted because blank space is assumed to be delimiter as well.
NOTE: Quotes inside strings can be escaped with backslash.
If the data directory specified in
-d option already
contains some data, the new rows are appended to the existing records.
If there is any mismatching in column names, NULL values will be used to
pad the rows. If there is any mismatching in the data types, the new
data type will be used and the existing records will be left unchanged.
Only differences of between signed integer and unsigned integers are
allowed, other mismatches will be flagged and nothing will be written.
The CSV files in directory
tests are not exactly the
standard CSV files, they contain an extra header line in each file. The
ardea skips them because they can not be properly
parsed into the data types specified. These header lines are here to
help the program in directory
readcsv. In most cases, the CSV files produced by other
systems will not have this extra header line. We mention the program
readcsv because it could serve as an example for those who
want write their own program to read data for FastBit. The source code
-part.txt. For example, after building the indexes in the directory
tmpgenerated by the above commend, we have the following files,
-rw-r--r-- 1 kwu Users 402 Aug 3 20:35 -part.txt -rw-r--r-- 1 kwu Users 400 Aug 3 20:35 a -rw-r--r-- 1 kwu Users 3520 Aug 4 23:14 a.idx -rw-r--r-- 1 kwu Users 400 Aug 3 20:35 b -rw-r--r-- 1 kwu Users 3520 Aug 4 23:14 b.idx -rw-r--r-- 1 kwu Users 200 Aug 3 20:35 c -rw-r--r-- 1 kwu Users 3520 Aug 4 23:14 c.idxIn this listing, we see three files with the column names we've given on the command line
c. The metadata file
-part.txtand three index files named
c.idx. Each index file contains all information necessary to reconstruct a bitmap index, including the bitmaps and the keyvalues associated with each bitmap. As described in LBNL-62756, the bitmaps and the keyvalues are densely packed. This design allows us to answer typicaly range queries efficiently. The specific details of each bitmap index is documented with the function
writein each index class.
In addition to the above files, there are a few other files that FastBit uses. If one plans to share the FastBit data directory with other applications, make sure none of them use the same names as FastBit.
-part.txt: the metadata file for a data partition. During certain operations, FastBit may store additional information into this file. If you have a pre-release version of FastBit, this file may be called
colname: name of the data file for a column. Note that column names can not contain dot (.) or space, see full list of restrictions.
colname.idx: name of the index file associated with column
colname. There can be one index for each column following this naming convention.
colname.msk: the bit vector containing the mask for null values of the column. The
ith bit of this vector is set to
1if the value of this column is not null.
colname.bin: a reorder version of the values. When values are binned, it is possible to generate this file to speed up certain search operations.
colname.sp: starting positions of strings in the raw data file of string valued column. Note that the raw string values are stored one after another along with their null terminators. This file contains information to speed up the reading specific string from the raw data file.
colname.tdlist: the term-document list used to build index
ibis::keywords. In this context, it is assumed that each row of this column is assumed to text document.
colname.terms: terms defined in
colname.tdlist. This file is always used together with
colname.dic: the dictionary of a low cardinality string-valued column. The low cardinality string-valued columns, also known as "categorical values", are essentially treated as an integer column by translating each distinct string value to an integer.
colname.int: the integers version of the categorical values (translated through the dictionary).
To convert data from another format to be used by FastBit, the key data
files to produce are
-part.txt and the binary data files
for columns. Next, we describe the content of these files.
Unless your data happen to be in vertically organized binary files, you will have to do a conversion. The conversion process basically reads the data in its original format and produce a copy in raw binary form, with each column going to its own file. The following are all the necessary requirements.
ith data value of each data file forms the
ith row of the table. Therefore, they must be all from the
-part.txtthat describes the data files in the directory. The required information includes
We mentioned two examples program for converting ASCII data into
vertically partitioned raw binary data files before,
One may follow these examples to develop new one.
dataDirectory = full-path-to-the-data-directory backupDirectory = full-path-to-the-backup-directoryA sample configuration file can be found at the end of this document. Typically, the
dataDirectoryis a parent directory that contains the directories for different partitions, where each partition is stored in its own directory. In particular, if a new partition is created, it is created as a new directory under the
If a configuration file is not specified, the new data will appear in a
.ibis in the current working directory.
ibiscommand line tool
ibiscommand line tool, this three steps can be performed with one invocation using the -a option. This option takes one mandatory arguments, which is the name of the directory containing the data to be appended. It also take two optional arguments in the form of to table_name, where the word table_name is the name of a table the new data is to be appended. The word to is used to make the command line slightly more friendly.
append. It has the following definition.
int ibis::part::append(const char *newDataDir);This function returns the number of rows appended if it is successful, otherwise it returns an negative values as error code.
The function to perform the integrity check is called
selfTest, which is defined as follows,
int ibis::part::selfTest(int numThreads=0, const char *prefix=0) const;This function performs a set of predefined tests and return the number of failures encountered. The function may take two arguments. The first is the number of threads to be used and the second is a prefixed used to get extra control information from the configuration files. Both arguments to this function are optional.
If the integrity tests went fine, one may commit the append operation by
commit, which has the following definition.
int ibis::part::commit(const char *newDataDir);Similar to function
append, this function returns the number of rows appended or an negative error code.
NOTE: The argument to
commit should be the
same as the argument to
If the append operation failed the integrity tests, one may rollback the
append operation by calling
int ibis::part::rollback();This function either return the number of rows removed or an negative number indicating error.
ibis::part::appendto work without the backup directory. Of course, the function
rollbackwould do nothing. If the
appendoperation failed, one will have to do a manual recovery. FastBit does not have to a tool to automate this recovery at this point in time.
BEGIN HEADER DataSet.Name=testData Number_of_rows=1000000 Number_of_columns=6 Table_State=1 END HEADER BEGIN Column name=i9 description=integers 0, 1, ..., and 9 data_type=Int END Column BEGIN Column name=j1 description=integers 0 and 1 data_type=Unsigned END Column BEGIN Column name=f0 description=float values 0 and 1 data_type=Float END Column BEGIN Column name=d0 description=double 0, same as i0 data_type=Double END Column BEGIN Column name=s1 description=26-lower case alphabets data_type=Category END Column BEGIN Column name=t1 description=26-lower case alphabets, same as s1 data_type=String END Column
ibis configuration file.
dataDirectory=/data/jwu/index backupDirectory=/data/jwu/backup CacheDirectory=/tmp/QECache fileManager.maxBytes=300Mb longTests=true preferMMapIndex=1
Note, the source code distribution also distributes two very small sets of data for testing. Some larger datasets are available on-line, follow this link to download these samples.
Warning: The number of rows in each data patition is recorded with a 32-bit integer, which limits the maximum number of rows in a data partition to be 232 (~ 4 billion). However, here are a number of more limitations on the number of rows that should be put in a data partition.