FastBit Front Page	Research Publications	Software Documentation	Software Download	Software License

Organization: LBNL » CRD » SDM » FastBit » Documentation » Data Loading

How to load data

Overview

In quickstart.html, a brief intruction for the loading data into FastBit was given. In this document, we give further details for those who want to know the internal structures in order to make more extensive uses of FastBit.

Sample data and usage examples

Recall that FastBit stores a data table in multiple partitions and each partition is stored in one directory on the file system. Next, we will first briefly describe how to use ardea to to create this directory, then give more details of the files in this directory, and finally end with some advices on how different functions can be used to integrate data into existing tables.

Using Existing Command Line Tools

Starting with ASCII form of data in the Comma-Separated Values (CSV) format, there is a command line tool named ardea that can digest the data and create the raw binary files and metadata files for FastBit. Very briefly, a CSV file contains values of a data table in ASCII format with commas (and possibly white spaces) as delimiters. It has all columns of a row on one line, and usually does not contain neither names of the columns or types of the columns. Therefore it is necessary to supply this information from elsewhere. The executable ardea has four options that deals with reading CSV data.

-d output-dir
Specifies the output directory to contain the data partition generated. Only one -d option is expected, if multiple of them are specified, the last one overwrite all previous ones.
-n name-of-partition
Specifies the name of the data partition. If a name is not specified, it will keep the name of the existing data in the output directory, however, if the output directory is empty, it will use the directory name. If the directory name is '.' or '..', then it will use a digest of the time stamp and size information as the name of the data partition.
-m name:type[, name:type, ...]
Specifies the names and types of the columns. The names and types given here must in the same order as the columns in the CSV file. When multiple option -m are specified, they are concatenated together.
-t text-filename
Specifies the name of the text file to be read. Multiple -t options may be used to specify multiple text files.
-r a-row-in-text-form
Specifies a single row of a table in ASCII form. Multiple rows may be specified by using multiple -r options. In ardea text files are processed before individual rows specified by option -r.
-M metadatafilename
Specifies a file containing the names and types. The format of this metadata file can be either simple name:type pairs as required by the -m option, or the more verbose form used in '-part.txt' files.
-tags "name1=value1, name2=value2, ..."
Specifies optional name-value pairs to be associated with the dataset. These name-value pairs are also called meta tags in FastBit source code. The names used in meta tags can also be used in query expressions, in which cases, they are equivalent to columns with a single text value (i.e., a categorical value).
-b delimiters
Specifies delimiters to use to parsing fields in the CSV files. When any one of the delimiters is encountered, the current field being processed is terminated. By default, the delimiters are any type of space or the coma, which means that any space can terminate a field as well as the coma. To use coma alone, specify "-b ," explicitly. Consecutive apparences of delimiters are treated a single delimiter. There is NO support for multi-character delimiter.

Regarding column names, FastBit imposes the following restrictions. The same restriction also apply to names of data partitions.

Column names must be composed from alphanumeric characters plus underscore, parentheses, brackets.
Column names must start with an alphabet or a underscore.
Column names are case-insensitive.

Following the above specification, some examples of valid column names are "col1", "c0ll" and "c01l" (clearly they are too confusing and only one of them should be used). Some examples of invalid names are "6e5", "e-f", and "e.f". Note also that columns named "name" and "NaMe" are treated as a single column.

The executable ardea supports a subset of the data types that are supported by query processing functions. The following is a complete list of data types supported by ardea

byte: 8-bit signed integer.
short: 16-bit signed integer.
int: 32-bit signed integer.
long: 64-bit signed integer.
unsigned byte: unsigned 8-bit signed integer.
unsigned short: unsigned 16-bit signed integer.
unsigned int: unsigned 32-bit signed integer.
unsigned long: unsigned 64-bit signed integer.
float: 32-bit IEEE floating-point values.
double: 64-bit IEEE floating-point values.
key: String values with a small number of distinct choices.
text: Arbitrary string values.

Except the unsigned integer types, all other types start with a different letter and ardea only test the first letter in these cases. The first letter can be in either upper or lower case. Therefore the example given in quickstart.html can be shorten slightly as

examples/ardea -d tmp -m "a:i,b:f,c:s" -t tests/test0.csv

NOTE: String values containing blank spaces must quoted because blank space is assumed to be delimiter as well.

NOTE: Quotes inside strings can be escaped with backslash.

If the data directory specified in -d option already contains some data, the new rows are appended to the existing records. If there is any mismatching in column names, NULL values will be used to pad the rows. If there is any mismatching in the data types, the new data type will be used and the existing records will be left unchanged. Only differences of between signed integer and unsigned integers are allowed, other mismatches will be flagged and nothing will be written.

The CSV files in directory tests are not exactly the standard CSV files, they contain an extra header line in each file. The command ardea skips them because they can not be properly parsed into the data types specified. These header lines are here to help the program in directory tests named readcsv. In most cases, the CSV files produced by other systems will not have this extra header line. We mention the program readcsv because it could serve as an example for those who want write their own program to read data for FastBit. The source code for readcsv is tests/readcsv.cpp.

Files in a Data Partition

In a directory containing a data partition, there are files for each column and the metadata file named -part.txt. For example, after building the indexes in the directory tmp generated by the above commend, we have the following files,

-rw-r--r-- 1 kwu Users  402 Aug  3 20:35 -part.txt
-rw-r--r-- 1 kwu Users  400 Aug  3 20:35 a
-rw-r--r-- 1 kwu Users 3520 Aug  4 23:14 a.idx
-rw-r--r-- 1 kwu Users  400 Aug  3 20:35 b
-rw-r--r-- 1 kwu Users 3520 Aug  4 23:14 b.idx
-rw-r--r-- 1 kwu Users  200 Aug  3 20:35 c
-rw-r--r-- 1 kwu Users 3520 Aug  4 23:14 c.idx

In this listing, we see three files with the column names we've given on the command line a, b and c. The metadata file -part.txt and three index files named a.idx, b.idx and c.idx. Each index file contains all information necessary to reconstruct a bitmap index, including the bitmaps and the keyvalues associated with each bitmap. As described in LBNL-62756, the bitmaps and the keyvalues are densely packed. This design allows us to answer typicaly range queries efficiently. The specific details of each bitmap index is documented with the function write in each index class.

In addition to the above files, there are a few other files that FastBit uses. If one plans to share the FastBit data directory with other applications, make sure none of them use the same names as FastBit.

-part.txt: the metadata file for a data partition. During certain operations, FastBit may store additional information into this file. If you have a pre-release version of FastBit, this file may be called table.tdc.
colname: name of the data file for a column. Note that column names can not contain dot (.) or space, see full list of restrictions.
colname.idx: name of the index file associated with column colname. There can be one index for each column following this naming convention.
colname.msk: the bit vector containing the mask for null values of the column. The ith bit of this vector is set to 1 if the value of this column is not null.
colname.bin: a reorder version of the values. When values are binned, it is possible to generate this file to speed up certain search operations.
colname.sp: starting positions of strings in the raw data file of string valued column. Note that the raw string values are stored one after another along with their null terminators. This file contains information to speed up the reading specific string from the raw data file.
colname.tdlist: the term-document list used to build index ibis::keywords. In this context, it is assumed that each row of this column is assumed to text document.
colname.terms: terms defined in colname.tdlist. This file is always used together with colname.idx.
colname.dic: the dictionary of a low cardinality string-valued column. The low cardinality string-valued columns, also known as "categorical values", are essentially treated as an integer column by translating each distinct string value to an integer.
colname.int: the integers version of the categorical values (translated through the dictionary).

Most of these files are output files used by FastBit, if a file with the same existing in the directory for the data partition, it is likely to be overwritten or removed by FastBit.

To convert data from another format to be used by FastBit, the key data files to produce are -part.txt and the binary data files for columns. Next, we describe the content of these files.

Conversion to binary data

Overall, FastBit views user data as tables consisting of rows and columns. This is similar to most common database systems. What is different is that it organizes each column of the data in its own binary file. In contrast, most existing database management systems organizes data of each row together. A typical layout of data table on paper is to show the rows horizontally and the columns vertically, thus the common data layout is known as the horizontal organization and our column-oriented organization is known as the vertical organization. The vertical organization ensures that only homogeneous data is in a file, which reduce access overhead in many data warehousing type of applications.

Unless your data happen to be in vertically organized binary files, you will have to do a conversion. The conversion process basically reads the data in its original format and produce a copy in raw binary form, with each column going to its own file. The following are all the necessary requirements.

All files to be appended to one data partition shall be in directory.
The ith data value of each data file forms the ith row of the table. Therefore, they must be all from the ith row.
Write all data values in raw binary format. The string values are to be written as raw bytes with null terminators and an empty string must have a null terminator.
An ASCII file named -part.txt that describes the data files in the directory. The required information includes
- Number of rows in the directory.
- Number of columns in the directory.
- Names of the columns.
- Data types of the columns.
Currently supported data types are given above.

We mentioned two examples program for converting ASCII data into vertically partitioned raw binary data files before, tests/readcsv.cpp and examples/readcsv.cpp. One may follow these examples to develop new one.

Appending new data to an existing table

FastBit IBIS implementation was initially designed to run as a server. To minimize the down time while appending new data, it maintains two copies of the same data and it performs the append operation in three steps:

First, the new data is appended to a backup copy of the active data directory.
The role of the backup directory and the active directory is swapped. This is the only time the clients have to be locked out and the operation itself takes negligible amount of time.
The user may perform some integrity tests on the newly combined data and choose either to rollback the append operation or commit it.

During an append operation, three directories are used, a directory to hold the new data, an active directory that can be used to answer user queries and a backup directory for accepting new data. If there is no more rows to be added, the backup directory is never used again and can be removed. A configuration file is required to tell FastBit where are various resources such the data directory and the backup directory. These directories can be specified as follows,

dataDirectory = full-path-to-the-data-directory
backupDirectory = full-path-to-the-backup-directory

A sample configuration file can be found at the end of this document. Typically, the dataDirectory is a parent directory that contains the directories for different partitions, where each partition is stored in its own directory. In particular, if a new partition is created, it is created as a new directory under the dataDirectory.

If a configuration file is not specified, the new data will appear in a directory named .ibis in the current working directory.

Appending data through `ibis` command line tool

Using the ibis command line tool, this three steps can be performed with one invocation using the -a option. This option takes one mandatory arguments, which is the name of the directory containing the data to be appended. It also take two optional arguments in the form of to table_name, where the word table_name is the name of a table the new data is to be appended. The word to is used to make the command line slightly more friendly.

Appending data through `ibis::part` API

The operation of appending new data to the backup directory and swapping the roles of the active directory and the backup directory is performed in one function named append. It has the following definition.

int ibis::part::append(const char *newDataDir);

This function returns the number of rows appended if it is successful, otherwise it returns an negative values as error code.

The function to perform the integrity check is called selfTest, which is defined as follows,

int ibis::part::selfTest(int numThreads=0, const char *prefix=0) const;

This function performs a set of predefined tests and return the number of failures encountered. The function may take two arguments. The first is the number of threads to be used and the second is a prefixed used to get extra control information from the configuration files. Both arguments to this function are optional.

If the integrity tests went fine, one may commit the append operation by calling commit, which has the following definition.

int ibis::part::commit(const char *newDataDir);

Similar to function append, this function returns the number of rows appended or an negative error code.

NOTE: The argument to commit should be the same as the argument to append.

If the append operation failed the integrity tests, one may rollback the append operation by calling rollback.

int ibis::part::rollback();

This function either return the number of rows removed or an negative number indicating error.

Append data without the backup directory

By only specifying the active data directory for a partition, it is possible to call ibis::part::append to work without the backup directory. Of course, the function commit and rollback would do nothing. If the append operation failed, one will have to do a manual recovery. FastBit does not have to a tool to automate this recovery at this point in time.

Appendix: Sample files

Here is a sample -part.txt file with 6 columns and 1 million rows. The six different columns are of six different types supported by IBIS. The description can be any text on one line.

BEGIN HEADER
DataSet.Name=testData
Number_of_rows=1000000
Number_of_columns=6
Table_State=1
END HEADER

BEGIN Column
name=i9
description=integers 0, 1, ..., and 9
data_type=Int
END Column

BEGIN Column
name=j1
description=integers 0 and 1
data_type=Unsigned
END Column

BEGIN Column
name=f0
description=float values 0 and 1
data_type=Float
END Column

BEGIN Column
name=d0
description=double 0, same as i0
data_type=Double
END Column

BEGIN Column
name=s1
description=26-lower case alphabets
data_type=Category
END Column

BEGIN Column
name=t1
description=26-lower case alphabets, same as s1
data_type=String
END Column

A sample ibis configuration file.

dataDirectory=/data/jwu/index
backupDirectory=/data/jwu/backup
CacheDirectory=/tmp/QECache
fileManager.maxBytes=300Mb
longTests=true
preferMMapIndex=1

Note, the source code distribution also distributes two very small sets of data for testing. Some larger datasets are available on-line, follow this link to download these samples.

Warning: The number of rows in each data patition is recorded with a 32-bit integer, which limits the maximum number of rows in a data partition to be 2³² (~ 4 billion). However, here are a number of more limitations on the number of rows that should be put in a data partition.

Each column is loaded into memory when the raw data is needed. If your machine has 2GB of memory, then the maximum number of rows in a data partition with 8-byte double-precision floating-point values is 250 million.
In order to build an index, both the raw data and the index must fit in the memory. This imposes an even lower upper bound on the number of rows in a data partition. To be safe, we assume the index size is going to take 5N words, where N is the number of rows in the partition. Due to memory fragmentation, these 5N words might actually occupy an address space of size 10N words. Therefore, we typically partition data so that each column takes no more than one-tenth of the memory.
Currently the result table of a query is stored in memory. This places an upper bound on how many rows can be stored in such a table.