Provide a dual-directional mapping between strings and integers. More...
#include <dict-0.h>
Public Member Functions | |
uint32_t | appendOrdered (const char *str) |
Append a string to the dictionary. More... | |
uint32_t | appendOrdered (const char *) |
void | clear () |
Clear the allocated memory. Leave only the NULL entry. More... | |
void | clear () |
void | copy (const dictionary &rhs) |
Copy function. Use copy constructor and swap the content. | |
void | copy (const dictionary &rhs) |
dictionary (const dictionary &dic) | |
Copy constructor. Places all the string in one contiguous buffer. | |
dictionary () | |
Default constructor. Generates one (NULL) entry. More... | |
dictionary (const dictionary &dic) | |
bool | equal_to (const ibis::dictionary &) const |
Compare whether this dicrionary and the other are equal in content. More... | |
bool | equal_to (const ibis::dictionary &) const |
const char * | find (const char *str) const |
Find the given string in the dictionary. More... | |
const char * | find (const char *str) const |
int | fromASCII (std::istream &) |
Read the ASCII formatted disctionary. More... | |
uint32_t | insert (const char *str) |
Insert a string to the dictionary. More... | |
uint32_t | insert (const char *, uint32_t) |
Insert a string to the specified position. More... | |
uint32_t | insert (const char *) |
uint32_t | insertRaw (char *str) |
Non-copying insert. More... | |
uint32_t | insertRaw (char *) |
int | merge (const dictionary &) |
Merge the incoming dictionary with this one. More... | |
int | merge (const dictionary &) |
int | morph (const dictionary &, array_t< uint32_t > &) const |
Produce an array that maps the integers in old dictionary to the new one. More... | |
int | morph (const dictionary &, array_t< uint32_t > &) const |
const char * | operator[] (uint32_t i) const |
Return a string corresponding to the integer. More... | |
uint32_t | operator[] (const char *str) const |
Convert a string to its integer code. More... | |
const char * | operator[] (uint32_t i) const |
uint32_t | operator[] (const char *str) const |
void | patternSearch (const char *pat, array_t< uint32_t > &matches) const |
Find all codes that matches the SQL LIKE pattern. More... | |
void | patternSearch (const char *pat, array_t< uint32_t > &matches) const |
int | read (const char *name) |
Read the content of the named file. More... | |
int | read (const char *) |
uint32_t | size () const |
Return the number of valid (not null) strings in the dictionary. | |
uint32_t | size () const |
Return the number of entries in the dictionary. More... | |
void | sort (array_t< uint32_t > &) |
Reassign the integer values to the strings. More... | |
void | sort (array_t< uint32_t > &) |
void | swap (dictionary &) |
Swap the content of two dictionaries. | |
void | swap (dictionary &) |
void | toASCII (std::ostream &) const |
Output the current content in ASCII format. More... | |
int | write (const char *name) const |
Write the content of the dictionary to the named file. More... | |
int | write (const char *) const |
Protected Types | |
typedef std::unordered_map< const char *, uint32_t, std::hash< const char * >, std::equal_to< const char * > > | MYMAP |
Member variable key_ contains the hash_map that connects a string value to an integer. More... | |
Protected Member Functions | |
void | mergeBuffers () const |
Merge all buffers into a single one. More... | |
int | readKeys (const char *, FILE *) |
Read the ordered strings. More... | |
int | readKeys0 (const char *, FILE *) |
Read the string values. More... | |
int | readKeys1 (const char *, FILE *) |
Read the string values. More... | |
int | readKeys2 (const char *, FILE *) |
Read the string values. More... | |
int | readRaw (const char *, FILE *) |
Read the raw strings. More... | |
int | readRaw (const char *, FILE *) |
int | writeBuffer (FILE *, uint32_t, array_t< uint64_t > &, array_t< uint32_t > &) const |
Write the buffer out directly. More... | |
int | writeKeys (FILE *, uint32_t, array_t< uint64_t > &, array_t< uint32_t > &) const |
Write the dictionary one keyword at a time. More... | |
Protected Attributes | |
array_t< char * > | buffer_ |
Member varaible buffer_ contains a list of pointers to the memory that holds the strings. More... | |
array_t< uint32_t > | code_ |
Member variable code_ contains the integer code for each string in key_. More... | |
array_t< const char * > | key_ |
Member variable key_ contains the string values in alphabetic order. | |
MYMAP | key_ |
array_t< const char * > | raw_ |
Member variable raw_ contains the string values in the order of the code assignment. More... | |
Provide a dual-directional mapping between strings and integers.
A utility class used by ibis::category. Both the NULL string and the empty string are mapped to 0.
A utility class used by ibis::category. The integer values are always treated as 32-bit unsigned integers. The NULL string is always mapped to 0xFFFFFFFF (-1U) and is NOT counted as an entry in a dictionary.
This version uses an in-memory hash_map to provide a mapping from a string to an integer.
|
protected |
Member variable key_ contains the hash_map that connects a string value to an integer.
|
inline |
Default constructor. Generates one (NULL) entry.
Default constructor.
uint32_t ibis::dictionary::appendOrdered | ( | const char * | str | ) |
Append a string to the dictionary.
Returns the integer value assigned to the string. A copy of the string is stored internally.
This function assumes the incoming string is ordered after all known strings to this dictionary object. In other word, this function expects the strings to be passed in in the sorted (ascending) order. It does not attempt to check that the incoming is indeed ordered after all known strings. However, if this assumption is violated, the resulting dictionary will not be able to work properly.
Returns the integer value assigned to the string. A copy of the string is stored internally.
This function assumes the incoming string is ordered after all known strings to this dictionary object. In other word, this function expects the strings to be given in the sorted (ascending) order. It does not attempt to check that the incoming string is indeed ordered. What this function relies on is that the incoming string is not a repeat of any existing strings.
References ibis::util::copy(), and ibis::util::strnewdup().
void ibis::dictionary::clear | ( | ) |
Clear the allocated memory. Leave only the NULL entry.
Clear the allocated memory.
bool ibis::dictionary::equal_to | ( | const ibis::dictionary & | other | ) | const |
Compare whether this dicrionary and the other are equal in content.
The two dictionaries are considered same only if they have the same keys and the the same integer representations.
The two dictionaries are considered same only if they have the same keys in the same order.
References code_, key_, and ibis::array_t< T >::size().
Referenced by ibis::bord::bord(), and ibis::category::setDictionary().
|
inline |
Find the given string in the dictionary.
If the input string is found in the dictionary, it returns the string. Otherwise it returns null pointer. This function makes a little easier to determine whether a string is in a dictionary.
Referenced by insert().
int ibis::dictionary::fromASCII | ( | std::istream & | in | ) |
Read the ASCII formatted disctionary.
This is meant to be the reverse of toASCII, where each line of the input stream contains a positve integer followed by a string value, with an optioinal ':' (plus white space) as separators.
The new entries read from the incoming I/O stream are merged with the existing dictioinary. If the string has already been assigned a code, the existing code will be used. If the given code has been used for another string, the incoming string will be assined a new code. Warning messages will be printed to the logging channel when such a conflict is encountered.
References ibis::fileManager::buffer< T >::address(), ibis::util::getString(), ibis::util::readUInt(), ibis::fileManager::buffer< T >::resize(), and ibis::fileManager::buffer< T >::size().
Referenced by ibis::tafel::writeMetaData().
uint32_t ibis::dictionary::insert | ( | const char * | str | ) |
Insert a string to the dictionary.
Returns the integer value assigned to the string. A copy of the string is stored internally.
References ibis::util::copy(), and ibis::util::strnewdup().
Referenced by ibis::category::category(), ibis::bord::column::selectUInts(), and ibis::column::string2int().
uint32_t ibis::dictionary::insert | ( | const char * | str, |
uint32_t | pos | ||
) |
Insert a string to the specified position.
Returns the integer value assigned to the string. A copy of the string is stored in the dictionary object.
If the incoming string value is already in the dictionary, the existing entry is erased and a new entry is inserted. If the specified position is already occupied, the existing entry is erased and a new entry is inserted. This is meant for user to update a dictionary, however, it may cause two existing entries to be erased. These erased enteries could invalidate dependent data structures such indexes and .int files.
References ibis::util::copy(), find(), and ibis::util::strnewdup().
uint32_t ibis::dictionary::insertRaw | ( | char * | str | ) |
Non-copying insert.
Non-copying insertion.
Do not make a copy of the input string. Transfers the ownership of str
to the dictionary. Caller needs to check whether it is a new word in the dictionary. If it is not a new word in the dictionary, the dictionary does not take ownership of the string argument.
int ibis::dictionary::merge | ( | const dictionary & | rhs | ) |
Merge the incoming dictionary with this one.
It produces a dictionary that combines the words in both dictionaries and keep the words in ascending order.
Upon successful completion of this function, the return value will be the new size of the dictionary, i.e., the number of non-empty words. It returns a negative value to indicate error.
It produces a dictionary that combines the words in both dictionaries. Existing words in the current dictionary will keep their current assignment.
Upon successful completion of this function, the return value will be the new size of the dictionary.
References key_, ibis::array_t< T >::push_back(), ibis::array_t< T >::reserve(), ibis::array_t< T >::size(), ibis::util::strnewdup(), and ibis::array_t< T >::swap().
Referenced by ibis::mensa::mergeCategories().
|
protected |
Merge all buffers into a single one.
New memory is allocated to store the string values together if they are stored in different locations currently.
int ibis::dictionary::morph | ( | const dictionary & | old, |
ibis::array_t< uint32_t > & | o2n | ||
) | const |
Produce an array that maps the integers in old dictionary to the new one.
The incoming dictionary represents the old dictionary, this dictionary represents the new one.
Upon successful completion of this fuction, the array o2n will have (old.size()+1) number of elements, where the new value for the old code i is stored as o2n[i].
References code_, key_, ibis::array_t< T >::resize(), and ibis::array_t< T >::size().
Referenced by ibis::category::setDictionary().
|
inline |
Return a string corresponding to the integer.
If the index is beyond the valid range, i.e., i > size(), then a null pointer will be returned.
uint32_t ibis::dictionary::operator[] | ( | const char * | str | ) | const |
Convert a string to its integer code.
Returns 0 for empty (null) strings, 1:size() for strings in the dictionary, and dictionary::size()+1 for unknown values.
Returns 0xFFFFFFFFU for null strings, 0:size()-1 for strings in the dictionary, and dictionary::size() for unknown values.
void ibis::dictionary::patternSearch | ( | const char * | pat, |
array_t< uint32_t > & | matches | ||
) | const |
Find all codes that matches the SQL LIKE pattern.
If the pattern is null or empty, matches is not changed.
References ibis::array_t< T >::push_back(), and ibis::util::strMatch().
int ibis::dictionary::read | ( | const char * | name | ) |
Read the content of the named file.
The file content is read into the buffer in one-shot and then digested.
The file content is read into the buffer in one-shot and then digested.
This function determines the version of the dictionary and invokes the necessary reading function to perform the actual reading operations. Currently there are three possible version of dictioanries 0x02000000 - the version produced by the current write function, 0x01000000 - the version with 64-bit offsets, consecutive kyes, strings are stored in key order 0x00000000 - the version 32-bit offsets and stores strings in sorted order. unmarked - the version without a header, only has the bare strings in the code order.
|
protected |
Read the ordered strings.
This function process the data produced by the write function. On successful completion, it returns 0.
References ibis::util::clear().
|
protected |
Read the string values.
This function processes the data produced by version 0x00000000 of the write function. On successful completion, it returns 0.
Note that this function assume the 20-byte header has been read already.
References ibis::util::clear().
|
protected |
Read the string values.
This function processes the data produced by version 0x01000000 of the write function. On successful completion, it returns 0.
References ibis::util::clear().
|
protected |
Read the string values.
This function processes the data produced by version 0x01000000 of the write function. On successful completion, it returns 0.
References ibis::util::clear(), and ibis::array_t< T >::resize().
|
protected |
Read the raw strings.
This is the older style dictionary that contains the raw strings. On successful completion, this function returns 1.
This is for the oldest style dictionary that contains the raw strings. There is no header in the dictionary file, therefore this function has rewind back to the beginning of the file. On successful completion, this function returns 0.
References ibis::util::clear(), and ibis::util::sortStrings().
|
inline |
Return the number of entries in the dictionary.
May have undefined entries.
void ibis::dictionary::sort | ( | ibis::array_t< uint32_t > & | o2n | ) |
Reassign the integer values to the strings.
Upon successful completion of this function, the integer values assigned to the strings will be in ascending order. In other word, string values that are lexigraphically smaller will have smaller integer representations.
The argument to this function carrys the permutation information needed to turn the previous integer assignments into the new ones. If the previous assignment was k, the new assignement will be o2n[k]. Note that the name o2n is shorthand for old-to-new.
References ibis::array_t< T >::resize().
Referenced by ibis::mensa::mergeCategories().
void ibis::dictionary::toASCII | ( | std::ostream & | out | ) | const |
Output the current content in ASCII format.
Each non-empty entry is printed in the format of "number: string".
int ibis::dictionary::write | ( | const char * | name | ) | const |
Write the content of the dictionary to the named file.
The existing content in the named file is overwritten. The content of the dictionary file is laid out as follows.
The existing content in the named file is overwritten. The content of the dictionary file is laid out as follows.
Referenced by ibis::bord::backup(), ibis::category::category(), and ibis::tafel::writeMetaData().
|
protected |
Write the buffer out directly.
This function is intended to be used by dictionary::write and must satisfy the following conditions. There must be only one buffer, and the raw_ must be ordered in that buffer. Under these conditions, we can write the buffer using a single sequential write operations, which should reduce the I/O time. The easiest way to satisfy these conditions is to invoke mergeBuffers.
|
protected |
Write the dictionary one keyword at a time.
This version requires on write call on each keyword, which can be time consuming when there are many keywords.
References ibis::array_t< T >::clear(), and ibis::array_t< T >::push_back().
|
protected |
Member varaible buffer_ contains a list of pointers to the memory that holds the strings.
Referenced by dictionary(), and swap().
|
protected |
Member variable code_ contains the integer code for each string in key_.
Referenced by dictionary(), equal_to(), morph(), and swap().
|
protected |
Member variable raw_ contains the string values in the order of the code assignment.
Referenced by dictionary(), and swap().