FastBit
  FastBit Front Page Research Publications Software Documentation Software Download Software License  

Organization: LBNL » CRD » SDM » FastBit » Documentation » Data Loading

How to load data

Overview

In quickstart.html, a brief intruction for the loading data into FastBit was given. In this document, we give further details for those who want to know the internal structures in order to make more extensive uses of FastBit.

Sample data and usage examples

Recall that FastBit stores a data table in multiple partitions and each partition is stored in one directory on the file system. Next, we will first briefly describe how to use ardea to to create this directory, then give more details of the files in this directory, and finally end with some advices on how different functions can be used to integrate data into existing tables.

Using Existing Command Line Tools

Starting with ASCII form of data in the Comma-Separated Values (CSV) format, there is a command line tool named ardea that can digest the data and create the raw binary files and metadata files for FastBit. Very briefly, a CSV file contains values of a data table in ASCII format with commas (and possibly white spaces) as delimiters. It has all columns of a row on one line, and usually does not contain neither names of the columns or types of the columns. Therefore it is necessary to supply this information from elsewhere. The executable ardea has four options that deals with reading CSV data.

Regarding column names, FastBit imposes the following restrictions. The same restriction also apply to names of data partitions. Following the above specification, some examples of valid column names are "col1", "c0ll" and "c01l" (clearly they are too confusing and only one of them should be used). Some examples of invalid names are "6e5", "e-f", and "e.f". Note also that columns named "name" and "NaMe" are treated as a single column.

The executable ardea supports a subset of the data types that are supported by query processing functions. The following is a complete list of data types supported by ardea

Except the unsigned integer types, all other types start with a different letter and ardea only test the first letter in these cases. The first letter can be in either upper or lower case. Therefore the example given in quickstart.html can be shorten slightly as
examples/ardea -d tmp -m "a:i,b:f,c:s" -t tests/test0.csv

NOTE: String values containing blank spaces must quoted because blank space is assumed to be delimiter as well.

NOTE: Quotes inside strings can be escaped with backslash.

If the data directory specified in -d option already contains some data, the new rows are appended to the existing records. If there is any mismatching in column names, NULL values will be used to pad the rows. If there is any mismatching in the data types, the new data type will be used and the existing records will be left unchanged. Only differences of between signed integer and unsigned integers are allowed, other mismatches will be flagged and nothing will be written.

The CSV files in directory tests are not exactly the standard CSV files, they contain an extra header line in each file. The command ardea skips them because they can not be properly parsed into the data types specified. These header lines are here to help the program in directory tests named readcsv. In most cases, the CSV files produced by other systems will not have this extra header line. We mention the program readcsv because it could serve as an example for those who want write their own program to read data for FastBit. The source code for readcsv is tests/readcsv.cpp.

Files in a Data Partition

In a directory containing a data partition, there are files for each column and the metadata file named -part.txt. For example, after building the indexes in the directory tmp generated by the above commend, we have the following files,
-rw-r--r-- 1 kwu Users  402 Aug  3 20:35 -part.txt
-rw-r--r-- 1 kwu Users  400 Aug  3 20:35 a
-rw-r--r-- 1 kwu Users 3520 Aug  4 23:14 a.idx
-rw-r--r-- 1 kwu Users  400 Aug  3 20:35 b
-rw-r--r-- 1 kwu Users 3520 Aug  4 23:14 b.idx
-rw-r--r-- 1 kwu Users  200 Aug  3 20:35 c
-rw-r--r-- 1 kwu Users 3520 Aug  4 23:14 c.idx
In this listing, we see three files with the column names we've given on the command line a, b and c. The metadata file -part.txt and three index files named a.idx, b.idx and c.idx. Each index file contains all information necessary to reconstruct a bitmap index, including the bitmaps and the keyvalues associated with each bitmap. As described in LBNL-62756, the bitmaps and the keyvalues are densely packed. This design allows us to answer typicaly range queries efficiently. The specific details of each bitmap index is documented with the function write in each index class.

In addition to the above files, there are a few other files that FastBit uses. If one plans to share the FastBit data directory with other applications, make sure none of them use the same names as FastBit.

Most of these files are output files used by FastBit, if a file with the same existing in the directory for the data partition, it is likely to be overwritten or removed by FastBit.

To convert data from another format to be used by FastBit, the key data files to produce are -part.txt and the binary data files for columns. Next, we describe the content of these files.

Conversion to binary data

Overall, FastBit views user data as tables consisting of rows and columns. This is similar to most common database systems. What is different is that it organizes each column of the data in its own binary file. In contrast, most existing database management systems organizes data of each row together. A typical layout of data table on paper is to show the rows horizontally and the columns vertically, thus the common data layout is known as the horizontal organization and our column-oriented organization is known as the vertical organization. The vertical organization ensures that only homogeneous data is in a file, which reduce access overhead in many data warehousing type of applications.

Unless your data happen to be in vertically organized binary files, you will have to do a conversion. The conversion process basically reads the data in its original format and produce a copy in raw binary form, with each column going to its own file. The following are all the necessary requirements.

  1. All files to be appended to one data partition shall be in directory.
  2. The ith data value of each data file forms the ith row of the table. Therefore, they must be all from the ith row.
  3. Write all data values in raw binary format. The string values are to be written as raw bytes with null terminators and an empty string must have a null terminator.
  4. An ASCII file named -part.txt that describes the data files in the directory. The required information includes Currently supported data types are given above.

We mentioned two examples program for converting ASCII data into vertically partitioned raw binary data files before, tests/readcsv.cpp and examples/readcsv.cpp. One may follow these examples to develop new one.

Appending new data to an existing table

FastBit IBIS implementation was initially designed to run as a server. To minimize the down time while appending new data, it maintains two copies of the same data and it performs the append operation in three steps:
  1. First, the new data is appended to a backup copy of the active data directory.
  2. The role of the backup directory and the active directory is swapped. This is the only time the clients have to be locked out and the operation itself takes negligible amount of time.
  3. The user may perform some integrity tests on the newly combined data and choose either to rollback the append operation or commit it.
During an append operation, three directories are used, a directory to hold the new data, an active directory that can be used to answer user queries and a backup directory for accepting new data. If there is no more rows to be added, the backup directory is never used again and can be removed. A configuration file is required to tell FastBit where are various resources such the data directory and the backup directory. These directories can be specified as follows,
dataDirectory = full-path-to-the-data-directory
backupDirectory = full-path-to-the-backup-directory
A sample configuration file can be found at the end of this document. Typically, the dataDirectory is a parent directory that contains the directories for different partitions, where each partition is stored in its own directory. In particular, if a new partition is created, it is created as a new directory under the dataDirectory.

If a configuration file is not specified, the new data will appear in a directory named .ibis in the current working directory.

Appending data through ibis command line tool

Using the ibis command line tool, this three steps can be performed with one invocation using the -a option. This option takes one mandatory arguments, which is the name of the directory containing the data to be appended. It also take two optional arguments in the form of to table_name, where the word table_name is the name of a table the new data is to be appended. The word to is used to make the command line slightly more friendly.

Appending data through ibis::part API

The operation of appending new data to the backup directory and swapping the roles of the active directory and the backup directory is performed in one function named append. It has the following definition.
int ibis::part::append(const char *newDataDir);
This function returns the number of rows appended if it is successful, otherwise it returns an negative values as error code.

The function to perform the integrity check is called selfTest, which is defined as follows,

int ibis::part::selfTest(int numThreads=0, const char *prefix=0) const;
This function performs a set of predefined tests and return the number of failures encountered. The function may take two arguments. The first is the number of threads to be used and the second is a prefixed used to get extra control information from the configuration files. Both arguments to this function are optional.

If the integrity tests went fine, one may commit the append operation by calling commit, which has the following definition.

int ibis::part::commit(const char *newDataDir);
Similar to function append, this function returns the number of rows appended or an negative error code.

NOTE: The argument to commit should be the same as the argument to append.

If the append operation failed the integrity tests, one may rollback the append operation by calling rollback.

int ibis::part::rollback();
This function either return the number of rows removed or an negative number indicating error.

Append data without the backup directory

By only specifying the active data directory for a partition, it is possible to call ibis::part::append to work without the backup directory. Of course, the function commit and rollback would do nothing. If the append operation failed, one will have to do a manual recovery. FastBit does not have to a tool to automate this recovery at this point in time.

Appendix: Sample files

Here is a sample -part.txt file with 6 columns and 1 million rows. The six different columns are of six different types supported by IBIS. The description can be any text on one line.
BEGIN HEADER
DataSet.Name=testData
Number_of_rows=1000000
Number_of_columns=6
Table_State=1
END HEADER

BEGIN Column
name=i9
description=integers 0, 1, ..., and 9
data_type=Int
END Column

BEGIN Column
name=j1
description=integers 0 and 1
data_type=Unsigned
END Column

BEGIN Column
name=f0
description=float values 0 and 1
data_type=Float
END Column

BEGIN Column
name=d0
description=double 0, same as i0
data_type=Double
END Column

BEGIN Column
name=s1
description=26-lower case alphabets
data_type=Category
END Column

BEGIN Column
name=t1
description=26-lower case alphabets, same as s1
data_type=String
END Column

A sample ibis configuration file.

dataDirectory=/data/jwu/index
backupDirectory=/data/jwu/backup
CacheDirectory=/tmp/QECache
fileManager.maxBytes=300Mb
longTests=true
preferMMapIndex=1

Note, the source code distribution also distributes two very small sets of data for testing. Some larger datasets are available on-line, follow this link to download these samples.

Warning: The number of rows in each data patition is recorded with a 32-bit integer, which limits the maximum number of rows in a data partition to be 232 (~ 4 billion). However, here are a number of more limitations on the number of rows that should be put in a data partition.