FastBit Front Page	Research Publications	Software Documentation	Software Download	Software License

Organization: LBNL » CRD » SDM » FastBit » Documentation » Data Distribution

How to compute data distributions using FastBit ibis::part API

FastBit uses the metaphor of rows and columns to describe user data. Given a set of user data (called a table), the process of finding the distribution of a column takes two steps, one to find the names of the columns available and one to actually compute the distribution of a particular column. If you need instructions on how to prepare the data, read the section on Preparing Data for FastBit or dataLoading.html.

Using Command Line Tool

One can use option -print to instruction ibis command line tool to print a variety of information. Here is a more in-depth explanation of this option.

-p[rint] [Parts|Columns|Distributions|column-name [:conditions]]

The four different arguments to this option is as follows.

Parts is for printing the names of all data partitions known to the program ibis, either through the configuration files or the -d options.
Columns is for printing the names of all columns in each partition.
Distributions is for printing the cumulative distribution of every column of every partition.
column-name [:conditions] is for printing information about the named column (from every table containing such a column). It also prints a detailed data distribution. One may apply a set of conditions to restrict the computation. The syntax for the conditions is same as that for the where-clause of the queries. For example, the following option instructs ibis to print the cumulative distribution of variable temperature subject to the condition that pressure > 1000 and H2O > 1e-5.
```
-print "temperature : pressure > 1000 and H2O > 1e-5"
```
Note: a quote is needed to ensure the string following option -p is passed as one single string.
Note: if the column-name happens to be the name of a data partition, some basic information about the partition is printed.
Finding out the name of the variables
The function ibis::part::getInfo() returns a pointer to a new ibis::part::info object. This object contains a list of column descriptions defined as follows.
```
ibis::part::info* ibis::part::getInfo() const;

struct ibis::part::info {
    const char* name;		// Table name.
    const char* description;	// A free-form description of the table.
    const char* metaTags;	// A string of name-value pairs.
    const size_t nrows;	// The number of rows in the table.
    // The list of columns in the table.
    std::vector<ibis::column::info*> cols;

    info(const ibis::part& tbl);	// The constructor.
    ~info();
};

struct ibis::column::info {
    const char* name;		// Column name.
    const char* description;	// A description about the column.
    const double expectedMin;	// The expected lower bound.
    const double expectedMax;	// The expected upper bound.
    const DATA_TYPE type;	// The type of the values.

    info(const ibis::column& col);
};

// DATA_TYPE is declared in class ibis::column.  Each values has to be
// preceded by ibis::column for safe use.  For example, INT has to be
// referred to as ibis::column::INT.
enum ibis::column::DATA_TYPE
{RID=0,  // Row ids (8-byte)
 KEY,    // categorical (string) values, low-cardinality string values
 STRING, // arbitrary strings
 FLOAT,  // IEEE 32-bit floating-point numbers
 DOUBLE, // IEEE 64-bit floating-point numbers
 FID,    // File ids (unsigned integers)
 INT,    // signed 4-byte integers
 UINT,   // unsigned 4-byte integers
 BYTE,   // signed 1-byte integers
 UBYTE,  // unsigned 1-byte integers
 SHORT,  // signed 2-byte integers
 USHORT, // unsigned 2-byte integers
 LONG,   // signed 8-byte integers
 ULONG}; // unsigned 8-byte integers
```
Alternatively, the user may construct an ibis::part::info object by invoking its constructor.
NOTE: Both ibis::part::info and ibis::column::info will become ill-defined if the ibis::part and ibis::column used to create them are deleted. Delete the info objects first.
NOTE: The variables expectedMin and expectedMax are expected values typically provided by the user who setup the data table. For example, a mass fraction is expected to be between 0 and 1. In a typical application, there is an expected range for most of the variables/columns. Due to one reason or another, the actual values may be outside of the range.
The actual minimum and maximum value can be obtained by calling the following functions by providing the column name as the argument,
```
// The actual minimum value in the named column.
double ibis::part::getActualMin(const char *cname) const;
// The actual maximum value in the named column.
double ibis::part::getActualMax(const char *cname) const;
```
If the column object is accessible, one may called the following functions instead,
```
// Compute the actual minimum value by reading the data or examining
// the index.  It returns DBL_MAX in case of error.
double ibis::column::getActualMin() const;
// Compute the actual maximum value by reading the data or examining
// the index.  It returns -DBL_MAX in case of error.
double ibis::column::getActualMax() const;
```
NOTE: The functions that computes the actual minimum and the maximum can make use of bitmap indices if they exist, otherwise they are computed by reading the user data.
Computing Binned Histogram

One-dimensional histogram
The following function counts the number of records falls between two consecutive values in the array bounds.
```
long ibis::part::getDistribution
     (const char *cname, const char *cond,
      std::vector<double>& bounds,
      std::vector<size_t>& counts) const;
```
The argument cname is the name of the variable whose distribution is to be computed. A set of arbitrary conditions can be applied in selecting the records. The syntax of conditions is the same as the where clause. For example, if there are two variables in the table named A and B, one may specify a set of conditions such as "A > 5 and B < 6" or "A > sqrt(B) and 1 < B < 4."
If this function is called with a set of values in ascending order in array bounds, the content of bounds will be used to define bin boundaries. Otherwise, this function will use a simple strategy to select bin boundaries either based on binning structure of the index for variable cname or a simple linear division of between the minimum and the maximum values selected.
Given n values in bounds, n+1 bins are defined as follows. The first bin is for any value that is less than the first value in bounds, typically displayed as (..., bounds[0]). The second bin is for the values between bounds[0] and bounds[1], more specifically [bounds[0], bounds[1]), which includes the left boundary of the bin but not the right boundary. There are n-1 such bins [bounds[i], bounds[i+1]). The last bin is for any value that is greater than or equal to bounds[n-1], [bounds[n-1], ...). The array counts contains n+1 values, one for each of the bins, indicating the number of records fall in the bin. The following is an illustration of the bins,
```
bin   0: (...,         bounds[0])
bin   1: [bounds[0],   bounds[1])
bin   2: [bounds[1],   bounds[2])
...
bin n-1: [bounds[n-2], bounds[n-1])
bin   n: [bounds[n-1], ...)
```
This function returns the number of bins upon successful completion. On failure, a negative number is returned.
An Example
Assuming we have an ibis::part object called T with columns named A, B, and C. To count the values of A between 0 and 100 in 10 bins and subject to the addition condition that B > 5 and 3 <= C < 17, we could do the following
```
char *cond="B > 5 and 3 <= C < 17 and 0 <= A <= 100";
std::vector bounds;
for (int i = 10; i < 100; i += 10)
    bounds.push_back(i);
std::vector counts;
long ierr = T.getDistribution("A", cond, bounds, counts);
```
Because the bins built by getDistribution have open bins on both sides, it necessary to explicitly limit the range of values for A. In the above example, we assume that the user wanted to include the value 0 in the first bin and the value 100 in the last bin. If this is not the case, one need to modify the condtions accordingly. For example, to include 0 but exclude 100, one may use the following set of conditions,
```
char *cond="B > 5 and 3 <= C < 17 and 0 <= A < 100";
```
Two-dimensional histogram
A function is also provided to compute joint distribution of two variables.
```
long ibis::part::getJointDistribution
     (const char *cname1, const char* cname2, const char *cond,
      std::vector<double>& bounds1,
      std::vector<double>& bounds2,
      std::vector<size_t>& counts) const;
```
This function has very similar calling sequence as the function getDistribution. The main difference is that there are two variable names and two set of bin boundaries. Let n₁ denote the number of values in array bounds1 and n₂ denote the number of values in array bounds2, there are n₁+1 bins for variable 1 and n₂+1 bins for variable 2. Algotether, there are (n₁+1)(n₂+1) bins. Similar to getDistribution, this function also return the number of bins upon successful completion.
The two-dimensional bins are linearized in array counts in the usually raster scan order. For each bin of variable 1, it packs all the bins defined by variables two one after another as illustrated in the following diagram.
```
(..., bounds1[0])	(..., bounds2[0])
(..., bounds1[0])	[bounds2[0], bounds[1])
(..., bounds1[0])	[bounds2[1], bounds[2])
...
(..., bounds1[0])	[bounds2[n2-1], ...)
[bounds1[0], bounds1[1])(..., bounds2[0])
[bounds1[0], bounds1[1])[bounds2[0], bounds[1])
[bounds1[0], bounds1[1])[bounds2[1], bounds[2])
...
[bounds1[1], bounds1[2])[bounds2[n2-1], ...)
[bounds1[2], bounds1[3])(..., bounds2[0])
...
[bounds1[n1-1], ...)	(..., bounds2[0])
[bounds1[n1-1], ...)	[bounds2[0], bounds2[1])
[bounds1[n1-1], ...)	[bounds2[1], bounds2[2])
...
[bounds1[n1-1], ...)	[bounds2[n2-1], ...)
```
Computing Cumulative Data Distribution
Having a table object, there are two ways to compute a column distribution, one makes use of std::vector to compute a more accurate distribution and the other tries to fill the distribution into two user provided arrays.
```
long ibis::part::getCumulativeDistribution
     (const char *cname, std::vector<double>& bounds,
      std::vector<size_t>& counts) const;
long ibis::part::getCumulativeDistribution
     (const char *cname, size_t nbc,
      double *bounds, size_t *counts) const;
long ibis::part::getCumulativeDistribution
     (const char *cname, const char *conditions,
      std::vector<double>& bounds,
      std::vector<size_t>& counts) const;
long ibis::part::getCumulativeDistribution
     (const char *cname, const char *conditions,
      size_t nbc, double *bounds, size_t *counts) const;
```
In the above functions, the argument cname is for the column name (i.e., ibis::column::info::name). The value counts[i] stores the number of rows whose column cname contains values less than bounds[i]. If there is no NULL values in the column, the value of counts[0] would be zero and the last element of bounds would have a value that is larger than the actual maximum.
A set of conditions may be applied to restriction the computation of the data distribution. For example, if there are two variables in the table named A and B, one may specify a set of conditions such as "A > 5 and B < 6" or "A > sqrt(B) and 1 < B < 4." When a set of conditions is specified, the cumulative data distribution returned is the distribution of the column subject to the conditions. This is useful for conditional analysis.
The input argument nbc is the maximum number of values to be stored in the two pointers that appear later in the calling sequence. The actual number of values stored in bounds and counts is never more than nbc.
If a column object is available, one may directly call the following function to get the data distribution, which is equivalent to the first version of the ibis::part::getCumulativeDistribution.
```
long
ibis::column::getCumulativeDistribution(std::vector<double>& bounds,
                                  std::vector<size_t>& counts) const;
```
NOTE: An invocation of any version of getCumulativeDistribution will cause a bitmap index to be created if one does not exist already. This may take some time. However, this will reduce the time required in the future operations.
Contact us
Disclaimers
FastBit web site
FastBit mailing list

How to compute data distributions using FastBit ibis::part API

Using Command Line Tool

Finding out the name of the variables

Computing Binned Histogram

One-dimensional histogram

An Example

Two-dimensional histogram

Computing Cumulative Data Distribution