Public Member Functions | Protected Types | Protected Member Functions | Protected Attributes | List of all members
ibis::dictionary Class Reference

Provide a dual-directional mapping between strings and integers. More...

#include <dict-0.h>

Public Member Functions

uint32_t appendOrdered (const char *str)
 Append a string to the dictionary. More...
 
uint32_t appendOrdered (const char *)
 
void clear ()
 Clear the allocated memory. Leave only the NULL entry. More...
 
void clear ()
 
void copy (const dictionary &rhs)
 Copy function. Use copy constructor and swap the content.
 
void copy (const dictionary &rhs)
 
 dictionary (const dictionary &dic)
 Copy constructor. Places all the string in one contiguous buffer.
 
 dictionary ()
 Default constructor. Generates one (NULL) entry. More...
 
 dictionary (const dictionary &dic)
 
bool equal_to (const ibis::dictionary &) const
 Compare whether this dicrionary and the other are equal in content. More...
 
bool equal_to (const ibis::dictionary &) const
 
const char * find (const char *str) const
 Find the given string in the dictionary. More...
 
const char * find (const char *str) const
 
int fromASCII (std::istream &)
 Read the ASCII formatted disctionary. More...
 
uint32_t insert (const char *str)
 Insert a string to the dictionary. More...
 
uint32_t insert (const char *, uint32_t)
 Insert a string to the specified position. More...
 
uint32_t insert (const char *)
 
uint32_t insertRaw (char *str)
 Non-copying insert. More...
 
uint32_t insertRaw (char *)
 
int merge (const dictionary &)
 Merge the incoming dictionary with this one. More...
 
int merge (const dictionary &)
 
int morph (const dictionary &, array_t< uint32_t > &) const
 Produce an array that maps the integers in old dictionary to the new one. More...
 
int morph (const dictionary &, array_t< uint32_t > &) const
 
const char * operator[] (uint32_t i) const
 Return a string corresponding to the integer. More...
 
uint32_t operator[] (const char *str) const
 Convert a string to its integer code. More...
 
const char * operator[] (uint32_t i) const
 
uint32_t operator[] (const char *str) const
 
void patternSearch (const char *pat, array_t< uint32_t > &matches) const
 Find all codes that matches the SQL LIKE pattern. More...
 
void patternSearch (const char *pat, array_t< uint32_t > &matches) const
 
int read (const char *name)
 Read the content of the named file. More...
 
int read (const char *)
 
uint32_t size () const
 Return the number of valid (not null) strings in the dictionary.
 
uint32_t size () const
 Return the number of entries in the dictionary. More...
 
void sort (array_t< uint32_t > &)
 Reassign the integer values to the strings. More...
 
void sort (array_t< uint32_t > &)
 
void swap (dictionary &)
 Swap the content of two dictionaries.
 
void swap (dictionary &)
 
void toASCII (std::ostream &) const
 Output the current content in ASCII format. More...
 
int write (const char *name) const
 Write the content of the dictionary to the named file. More...
 
int write (const char *) const
 

Protected Types

typedef std::unordered_map< const char *, uint32_t, std::hash< const char * >, std::equal_to< const char * > > MYMAP
 Member variable key_ contains the hash_map that connects a string value to an integer. More...
 

Protected Member Functions

void mergeBuffers () const
 Merge all buffers into a single one. More...
 
int readKeys (const char *, FILE *)
 Read the ordered strings. More...
 
int readKeys0 (const char *, FILE *)
 Read the string values. More...
 
int readKeys1 (const char *, FILE *)
 Read the string values. More...
 
int readKeys2 (const char *, FILE *)
 Read the string values. More...
 
int readRaw (const char *, FILE *)
 Read the raw strings. More...
 
int readRaw (const char *, FILE *)
 
int writeBuffer (FILE *, uint32_t, array_t< uint64_t > &, array_t< uint32_t > &) const
 Write the buffer out directly. More...
 
int writeKeys (FILE *, uint32_t, array_t< uint64_t > &, array_t< uint32_t > &) const
 Write the dictionary one keyword at a time. More...
 

Protected Attributes

array_t< char * > buffer_
 Member varaible buffer_ contains a list of pointers to the memory that holds the strings. More...
 
array_t< uint32_t > code_
 Member variable code_ contains the integer code for each string in key_. More...
 
array_t< const char * > key_
 Member variable key_ contains the string values in alphabetic order.
 
MYMAP key_
 
array_t< const char * > raw_
 Member variable raw_ contains the string values in the order of the code assignment. More...
 

Detailed Description

Provide a dual-directional mapping between strings and integers.

A utility class used by ibis::category. Both the NULL string and the empty string are mapped to 0.

Note
If FASTBIT_CASE_SENSITIVE_COMPARE is defined to be 0, the values stored in a dictionary will be folded to the upper case. This will allow the words in the dictionary to be stored in a simple sorted order. By default, the dictionary is case sensitive.

A utility class used by ibis::category. The integer values are always treated as 32-bit unsigned integers. The NULL string is always mapped to 0xFFFFFFFF (-1U) and is NOT counted as an entry in a dictionary.

This version uses an in-memory hash_map to provide a mapping from a string to an integer.

Note
The integer returned from this class is a unsigned 32-bit integer (uint32_t). This limits the size of the dictionary to be no more than 2^32 entries. The dictionary file is written with 64-bit internal pointers. However, since the dictionary has to be read into memory completely before any use, the size of a dictionary is generally limited by the size of the computer memory.
If FASTBIT_CASE_SENSITIVE_COMPARE is defined to be 0, the values stored in a dictionary will be folded to the upper case. This will allow the words in the dictionary to be stored in a simple sorted order. By default, the dictionary is case sensitive.

Member Typedef Documentation

typedef std::unordered_map<const char*, uint32_t, std::hash<const char*>, std::equal_to<const char*> > ibis::dictionary::MYMAP
protected

Member variable key_ contains the hash_map that connects a string value to an integer.

Constructor & Destructor Documentation

ibis::dictionary::dictionary ( )
inline

Default constructor. Generates one (NULL) entry.

Default constructor.

Member Function Documentation

uint32_t ibis::dictionary::appendOrdered ( const char *  str)

Append a string to the dictionary.

Returns the integer value assigned to the string. A copy of the string is stored internally.

This function assumes the incoming string is ordered after all known strings to this dictionary object. In other word, this function expects the strings to be passed in in the sorted (ascending) order. It does not attempt to check that the incoming is indeed ordered after all known strings. However, if this assumption is violated, the resulting dictionary will not be able to work properly.

Note
The incoming string is copied to this object.

Returns the integer value assigned to the string. A copy of the string is stored internally.

This function assumes the incoming string is ordered after all known strings to this dictionary object. In other word, this function expects the strings to be given in the sorted (ascending) order. It does not attempt to check that the incoming string is indeed ordered. What this function relies on is that the incoming string is not a repeat of any existing strings.

References ibis::util::copy(), and ibis::util::strnewdup().

void ibis::dictionary::clear ( )

Clear the allocated memory. Leave only the NULL entry.

Clear the allocated memory.

bool ibis::dictionary::equal_to ( const ibis::dictionary other) const

Compare whether this dicrionary and the other are equal in content.

The two dictionaries are considered same only if they have the same keys and the the same integer representations.

The two dictionaries are considered same only if they have the same keys in the same order.

References code_, key_, and ibis::array_t< T >::size().

Referenced by ibis::bord::bord(), and ibis::category::setDictionary().

const char * ibis::dictionary::find ( const char *  str) const
inline

Find the given string in the dictionary.

If the input string is found in the dictionary, it returns the string. Otherwise it returns null pointer. This function makes a little easier to determine whether a string is in a dictionary.

Referenced by insert().

int ibis::dictionary::fromASCII ( std::istream &  in)

Read the ASCII formatted disctionary.

This is meant to be the reverse of toASCII, where each line of the input stream contains a positve integer followed by a string value, with an optioinal ':' (plus white space) as separators.

The new entries read from the incoming I/O stream are merged with the existing dictioinary. If the string has already been assigned a code, the existing code will be used. If the given code has been used for another string, the incoming string will be assined a new code. Warning messages will be printed to the logging channel when such a conflict is encountered.

References ibis::fileManager::buffer< T >::address(), ibis::util::getString(), ibis::util::readUInt(), ibis::fileManager::buffer< T >::resize(), and ibis::fileManager::buffer< T >::size().

Referenced by ibis::tafel::writeMetaData().

uint32_t ibis::dictionary::insert ( const char *  str)

Insert a string to the dictionary.

Returns the integer value assigned to the string. A copy of the string is stored internally.

References ibis::util::copy(), and ibis::util::strnewdup().

Referenced by ibis::category::category(), ibis::bord::column::selectUInts(), and ibis::column::string2int().

uint32_t ibis::dictionary::insert ( const char *  str,
uint32_t  pos 
)

Insert a string to the specified position.

Returns the integer value assigned to the string. A copy of the string is stored in the dictionary object.

If the incoming string value is already in the dictionary, the existing entry is erased and a new entry is inserted. If the specified position is already occupied, the existing entry is erased and a new entry is inserted. This is meant for user to update a dictionary, however, it may cause two existing entries to be erased. These erased enteries could invalidate dependent data structures such indexes and .int files.

Warning
Use this function only to build a new dictioinary.

References ibis::util::copy(), find(), and ibis::util::strnewdup().

uint32_t ibis::dictionary::insertRaw ( char *  str)

Non-copying insert.

Non-copying insertion.

Do not make a copy of the input string. Transfers the ownership of str to the dictionary. Caller needs to check whether it is a new word in the dictionary. If it is not a new word in the dictionary, the dictionary does not take ownership of the string argument.

int ibis::dictionary::merge ( const dictionary rhs)

Merge the incoming dictionary with this one.

It produces a dictionary that combines the words in both dictionaries and keep the words in ascending order.

Upon successful completion of this function, the return value will be the new size of the dictionary, i.e., the number of non-empty words. It returns a negative value to indicate error.

It produces a dictionary that combines the words in both dictionaries. Existing words in the current dictionary will keep their current assignment.

Upon successful completion of this function, the return value will be the new size of the dictionary.

References key_, ibis::array_t< T >::push_back(), ibis::array_t< T >::reserve(), ibis::array_t< T >::size(), ibis::util::strnewdup(), and ibis::array_t< T >::swap().

Referenced by ibis::mensa::mergeCategories().

void ibis::dictionary::mergeBuffers ( ) const
protected

Merge all buffers into a single one.

New memory is allocated to store the string values together if they are stored in different locations currently.

Note
Logically, this function does not change the content of the dictionary, but it actually need to change a number of pointers. The implementation of the function uses the copy-swap idiom to take advantage of the copy constructor.
int ibis::dictionary::morph ( const dictionary old,
ibis::array_t< uint32_t > &  o2n 
) const

Produce an array that maps the integers in old dictionary to the new one.

The incoming dictionary represents the old dictionary, this dictionary represents the new one.

Upon successful completion of this fuction, the array o2n will have (old.size()+1) number of elements, where the new value for the old code i is stored as o2n[i].

References code_, key_, ibis::array_t< T >::resize(), and ibis::array_t< T >::size().

Referenced by ibis::category::setDictionary().

const char * ibis::dictionary::operator[] ( uint32_t  i) const
inline

Return a string corresponding to the integer.

If the index is beyond the valid range, i.e., i > size(), then a null pointer will be returned.

uint32_t ibis::dictionary::operator[] ( const char *  str) const

Convert a string to its integer code.

Returns 0 for empty (null) strings, 1:size() for strings in the dictionary, and dictionary::size()+1 for unknown values.

Returns 0xFFFFFFFFU for null strings, 0:size()-1 for strings in the dictionary, and dictionary::size() for unknown values.

void ibis::dictionary::patternSearch ( const char *  pat,
array_t< uint32_t > &  matches 
) const

Find all codes that matches the SQL LIKE pattern.

If the pattern is null or empty, matches is not changed.

References ibis::array_t< T >::push_back(), and ibis::util::strMatch().

int ibis::dictionary::read ( const char *  name)

Read the content of the named file.

The file content is read into the buffer in one-shot and then digested.

The file content is read into the buffer in one-shot and then digested.

This function determines the version of the dictionary and invokes the necessary reading function to perform the actual reading operations. Currently there are three possible version of dictioanries 0x02000000 - the version produced by the current write function, 0x01000000 - the version with 64-bit offsets, consecutive kyes, strings are stored in key order 0x00000000 - the version 32-bit offsets and stores strings in sorted order. unmarked - the version without a header, only has the bare strings in the code order.

int ibis::dictionary::readKeys ( const char *  evt,
FILE *  fptr 
)
protected

Read the ordered strings.

This function process the data produced by the write function. On successful completion, it returns 0.

References ibis::util::clear().

int ibis::dictionary::readKeys0 ( const char *  evt,
FILE *  fptr 
)
protected

Read the string values.

This function processes the data produced by version 0x00000000 of the write function. On successful completion, it returns 0.

Note that this function assume the 20-byte header has been read already.

References ibis::util::clear().

int ibis::dictionary::readKeys1 ( const char *  evt,
FILE *  fptr 
)
protected

Read the string values.

This function processes the data produced by version 0x01000000 of the write function. On successful completion, it returns 0.

References ibis::util::clear().

int ibis::dictionary::readKeys2 ( const char *  evt,
FILE *  fptr 
)
protected

Read the string values.

This function processes the data produced by version 0x01000000 of the write function. On successful completion, it returns 0.

References ibis::util::clear(), and ibis::array_t< T >::resize().

int ibis::dictionary::readRaw ( const char *  evt,
FILE *  fptr 
)
protected

Read the raw strings.

This is the older style dictionary that contains the raw strings. On successful completion, this function returns 1.

This is for the oldest style dictionary that contains the raw strings. There is no header in the dictionary file, therefore this function has rewind back to the beginning of the file. On successful completion, this function returns 0.

References ibis::util::clear(), and ibis::util::sortStrings().

uint32_t ibis::dictionary::size ( ) const
inline

Return the number of entries in the dictionary.

May have undefined entries.

void ibis::dictionary::sort ( ibis::array_t< uint32_t > &  o2n)

Reassign the integer values to the strings.

Upon successful completion of this function, the integer values assigned to the strings will be in ascending order. In other word, string values that are lexigraphically smaller will have smaller integer representations.

The argument to this function carrys the permutation information needed to turn the previous integer assignments into the new ones. If the previous assignment was k, the new assignement will be o2n[k]. Note that the name o2n is shorthand for old-to-new.

References ibis::array_t< T >::resize().

Referenced by ibis::mensa::mergeCategories().

void ibis::dictionary::toASCII ( std::ostream &  out) const

Output the current content in ASCII format.

Each non-empty entry is printed in the format of "number: string".

int ibis::dictionary::write ( const char *  name) const

Write the content of the dictionary to the named file.

The existing content in the named file is overwritten. The content of the dictionary file is laid out as follows.

  • Signature "#IBIS Dictionary " and version number (currently 0). (20 bytes)
  • N = Number of strings in the file. (4 bytes)
  • uint32_t[N]: the integer values assigned to the strings.
  • uint32_t[N+1]: the starting positions of the strings in this file.
  • the string values packed one after the other with nil terminators.

The existing content in the named file is overwritten. The content of the dictionary file is laid out as follows.

  • Signature "#IBIS Dictionary " and version number (currently 0x020000). (20 bytes)
  • N = Number of strings in the file. (4 bytes)
  • uint64_t[N+1]: the starting positions of the strings in this file.
  • uint32_t[N]: The integer code corresponding to each string value.
  • the string values packed one after the other with their nil terminators.

Referenced by ibis::bord::backup(), ibis::category::category(), and ibis::tafel::writeMetaData().

int ibis::dictionary::writeBuffer ( FILE *  fptr,
uint32_t  nkeys,
array_t< uint64_t > &  pos,
array_t< uint32_t > &  qos 
) const
protected

Write the buffer out directly.

This function is intended to be used by dictionary::write and must satisfy the following conditions. There must be only one buffer, and the raw_ must be ordered in that buffer. Under these conditions, we can write the buffer using a single sequential write operations, which should reduce the I/O time. The easiest way to satisfy these conditions is to invoke mergeBuffers.

int ibis::dictionary::writeKeys ( FILE *  fptr,
uint32_t  nkeys,
array_t< uint64_t > &  pos,
array_t< uint32_t > &  qos 
) const
protected

Write the dictionary one keyword at a time.

This version requires on write call on each keyword, which can be time consuming when there are many keywords.

References ibis::array_t< T >::clear(), and ibis::array_t< T >::push_back().

Member Data Documentation

array_t< char * > ibis::dictionary::buffer_
protected

Member varaible buffer_ contains a list of pointers to the memory that holds the strings.

Referenced by dictionary(), and swap().

array_t<uint32_t> ibis::dictionary::code_
protected

Member variable code_ contains the integer code for each string in key_.

Referenced by dictionary(), equal_to(), morph(), and swap().

array_t< const char * > ibis::dictionary::raw_
protected

Member variable raw_ contains the string values in the order of the code assignment.

Referenced by dictionary(), and swap().


The documentation for this class was generated from the following files:

Make It A Bit Faster
Contact us
Disclaimers
FastBit source code
FastBit mailing list archive