FastBit Front Page | Research Publications | Software Documentation | Software Download | Software License |
Organization: LBNL » CRD » SDM » FastBit » Documentation » Data Distribution
FastBit uses the metaphor of rows and columns to describe user data. Given a set of user data (called a table), the process of finding the distribution of a column takes two steps, one to find the names of the columns available and one to actually compute the distribution of a particular column. If you need instructions on how to prepare the data, read the section on Preparing Data for FastBit or dataLoading.html.
-print
to instruction ibis
command line tool to print a variety of information. Here is a more
in-depth explanation of this option.
-p[rint] [Parts|Columns|Distributions|column-name [:conditions]]The four different arguments to this option is as follows.
Parts
is for printing the names of all data partitions
known to the program ibis
, either through the
configuration files or the -d
options.
Columns
is for printing the names of all columns in each
partition.
Distributions
is for printing the cumulative
distribution of every column of every partition.
column-name [:conditions]
is for printing information
about the named column (from every table containing such a column). It
also prints a detailed data distribution. One may apply a set of
conditions to restrict the computation. The syntax for the conditions
is same as that for the where-clause of the queries. For example, the
following option instructs ibis
to print the cumulative
distribution of variable temperature
subject to the
condition that pressure > 1000 and H2O > 1e-5
.
-print "temperature : pressure > 1000 and H2O > 1e-5"
Note: a quote is needed to ensure the string following option
-p
is passed as one single string.
Note: if the column-name happens to be the name of a data partition, some basic information about the partition is printed.
ibis::part::getInfo()
returns a pointer to a
new ibis::part::info
object. This object contains a list
of column descriptions defined as follows.
ibis::part::info* ibis::part::getInfo() const; struct ibis::part::info { const char* name; // Table name. const char* description; // A free-form description of the table. const char* metaTags; // A string of name-value pairs. const size_t nrows; // The number of rows in the table. // The list of columns in the table. std::vector<ibis::column::info*> cols; info(const ibis::part& tbl); // The constructor. ~info(); }; struct ibis::column::info { const char* name; // Column name. const char* description; // A description about the column. const double expectedMin; // The expected lower bound. const double expectedMax; // The expected upper bound. const DATA_TYPE type; // The type of the values. info(const ibis::column& col); }; // DATA_TYPE is declared in class ibis::column. Each values has to be // preceded by ibis::column for safe use. For example, INT has to be // referred to as ibis::column::INT. enum ibis::column::DATA_TYPE {RID=0, // Row ids (8-byte) KEY, // categorical (string) values, low-cardinality string values STRING, // arbitrary strings FLOAT, // IEEE 32-bit floating-point numbers DOUBLE, // IEEE 64-bit floating-point numbers FID, // File ids (unsigned integers) INT, // signed 4-byte integers UINT, // unsigned 4-byte integers BYTE, // signed 1-byte integers UBYTE, // unsigned 1-byte integers SHORT, // signed 2-byte integers USHORT, // unsigned 2-byte integers LONG, // signed 8-byte integers ULONG}; // unsigned 8-byte integersAlternatively, the user may construct an
ibis::part::info
object by invoking its constructor.
ibis::part::info
and
ibis::column::info
will become ill-defined if the
ibis::part
and ibis::column
used to create
them are deleted. Delete the info
objects first.
expectedMin
and
expectedMax
are expected values typically provided by the
user who setup the data table. For example, a mass fraction is expected
to be between 0 and 1. In a typical application, there is an expected
range for most of the variables/columns. Due to one reason or another,
the actual values may be outside of the range.
The actual minimum and maximum value can be obtained by calling the following functions by providing the column name as the argument,
// The actual minimum value in the named column. double ibis::part::getActualMin(const char *cname) const; // The actual maximum value in the named column. double ibis::part::getActualMax(const char *cname) const;If the column object is accessible, one may called the following functions instead,
// Compute the actual minimum value by reading the data or examining // the index. It returns DBL_MAX in case of error. double ibis::column::getActualMin() const; // Compute the actual maximum value by reading the data or examining // the index. It returns -DBL_MAX in case of error. double ibis::column::getActualMax() const;
bounds
.
long ibis::part::getDistribution (const char *cname, const char *cond, std::vector<double>& bounds, std::vector<size_t>& counts) const;The argument
cname
is the name of the variable whose
distribution is to be computed. A set of arbitrary conditions can be
applied in selecting the records. The syntax of conditions is the same
as the where clause. For example, if there are two variables in the
table named A
and B
, one may specify a set of
conditions such as "A > 5 and B < 6" or "A > sqrt(B) and 1 < B < 4."
If this function is called with a set of values in ascending order in
array bounds
, the content of bounds
will be
used to define bin boundaries. Otherwise, this function will use a
simple strategy to select bin boundaries either based on binning
structure of the index for variable cname
or a simple
linear division of between the minimum and the maximum values selected.
Given n
values in bounds
, n+1
bins are defined as follows. The first bin is for any value that is
less than the first value in bounds
, typically displayed as
(..., bounds[0])
. The second bin is for the values between
bounds[0]
and bounds[1]
, more specifically
[bounds[0], bounds[1])
, which includes the left boundary of
the bin but not the right boundary. There are n-1
such
bins [bounds[i], bounds[i+1])
. The last bin is for any
value that is greater than or equal to bounds[n-1]
,
[bounds[n-1], ...)
. The array counts
contains
n+1
values, one for each of the bins, indicating the number
of records fall in the bin. The following is an illustration of the
bins,
bin 0: (..., bounds[0]) bin 1: [bounds[0], bounds[1]) bin 2: [bounds[1], bounds[2]) ... bin n-1: [bounds[n-2], bounds[n-1]) bin n: [bounds[n-1], ...)
This function returns the number of bins upon successful completion. On failure, a negative number is returned.
ibis::part
object called
T
with columns named A
, B
, and
C
. To count the values of A
between 0 and 100
in 10 bins and subject to the addition condition that B > 5 and 3
<= C < 17
, we could do the following
char *cond="B > 5 and 3 <= C < 17 and 0 <= A <= 100"; std::vectorBecause the bins built bybounds; for (int i = 10; i < 100; i += 10) bounds.push_back(i); std::vector counts; long ierr = T.getDistribution("A", cond, bounds, counts);
getDistribution
have open bins on
both sides, it necessary to explicitly limit the range of values for
A
. In the above example, we assume that the user wanted to
include the value 0 in the first bin and the value 100 in the last bin.
If this is not the case, one need to modify the condtions accordingly.
For example, to include 0 but exclude 100, one may use the following set
of conditions,
char *cond="B > 5 and 3 <= C < 17 and 0 <= A < 100";
long ibis::part::getJointDistribution (const char *cname1, const char* cname2, const char *cond, std::vector<double>& bounds1, std::vector<double>& bounds2, std::vector<size_t>& counts) const;This function has very similar calling sequence as the function
getDistribution
. The main difference is that there are two
variable names and two set of bin boundaries. Let
n1
denote the number of values in array
bounds1
and n2
denote the number of
values in array bounds2
, there are
n1+1
bins for variable 1 and
n2+1
bins for variable 2. Algotether, there are
(n1+1)(n2+1)
bins. Similar to
getDistribution
, this function also return the number of
bins upon successful completion.
The two-dimensional bins are linearized in array counts
in
the usually raster scan order. For each bin of variable 1, it packs all
the bins defined by variables two one after another as illustrated in
the following diagram.
(..., bounds1[0]) (..., bounds2[0]) (..., bounds1[0]) [bounds2[0], bounds[1]) (..., bounds1[0]) [bounds2[1], bounds[2]) ... (..., bounds1[0]) [bounds2[n2-1], ...) [bounds1[0], bounds1[1])(..., bounds2[0]) [bounds1[0], bounds1[1])[bounds2[0], bounds[1]) [bounds1[0], bounds1[1])[bounds2[1], bounds[2]) ... [bounds1[1], bounds1[2])[bounds2[n2-1], ...) [bounds1[2], bounds1[3])(..., bounds2[0]) ... [bounds1[n1-1], ...) (..., bounds2[0]) [bounds1[n1-1], ...) [bounds2[0], bounds2[1]) [bounds1[n1-1], ...) [bounds2[1], bounds2[2]) ... [bounds1[n1-1], ...) [bounds2[n2-1], ...)
table
object, there are two ways to compute a
column distribution, one makes use of std::vector
to
compute a more accurate distribution and the other tries to fill the
distribution into two user provided arrays.
long ibis::part::getCumulativeDistribution (const char *cname, std::vector<double>& bounds, std::vector<size_t>& counts) const; long ibis::part::getCumulativeDistribution (const char *cname, size_t nbc, double *bounds, size_t *counts) const; long ibis::part::getCumulativeDistribution (const char *cname, const char *conditions, std::vector<double>& bounds, std::vector<size_t>& counts) const; long ibis::part::getCumulativeDistribution (const char *cname, const char *conditions, size_t nbc, double *bounds, size_t *counts) const;In the above functions, the argument
cname
is for the
column name (i.e., ibis::column::info::name
). The value
counts[i]
stores the number of rows whose column
cname
contains values less than bounds[i]
. If
there is no NULL values in the column, the value of
counts[0]
would be zero and the last element of
bounds
would have a value that is larger than the actual
maximum.
A set of conditions may be applied to restriction the computation of the
data distribution. For example, if there are two variables in the table
named A
and B
, one may specify a set of
conditions such as "A > 5 and B < 6" or "A > sqrt(B) and 1 < B < 4."
When a set of conditions is specified, the cumulative data distribution
returned is the distribution of the column subject to the conditions.
This is useful for conditional analysis.
The input argument nbc
is the maximum number of values to
be stored in the two pointers that appear later in the calling sequence.
The actual number of values stored in bounds
and
counts
is never more than nbc
.
If a column object is available, one may directly call the following
function to get the data distribution, which is equivalent to the first
version of the ibis::part::getCumulativeDistribution
.
long ibis::column::getCumulativeDistribution(std::vector<double>& bounds, std::vector<size_t>& counts) const;
getCumulativeDistribution
will cause a bitmap index to be
created if one does not exist already. This may take some time.
However, this will reduce the time required in the future operations.