-
Notifications
You must be signed in to change notification settings - Fork 11
C++ Library implementing Compressed String Dictionaries
License
migumar2/libCSD
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
libCSD is a C++ library providing some different techniques for managing string dictionaries in compressed space. These approaches are inspired on the paper: ========================================================================== "Compressed String Dictionaries" Nieves R. Brisaboa, Rodrigo Cánovas, Francisco Claude, Miguel A. Martínez-Prieto, and Gonzalo Navarro 10th Symposium on Experimental Algorithms (SEA'2011), p.136-147, 2011. ========================================================================== The current release is a beta version providing full functionality for string location and ID extraction. Besides, some techniques support queries by prefix and substrings. The set of techniques is described as follows: - "Hash": implements four different techniques which combines hashing and Huffman / Re-Pair compression. - "FC": implements two different techniques based on Front-Coding compression. PFC is a straightforward byte-oriented implementation, and HTFC performs Hu-Tucker compression over the headers while supports Huffman / Re-Pair compression for the internal strings. - "RPDAC": uses Re-Pair for compressing the strings and. - "FMI" : self-indexes the dictionary and provides all types of queries using the FM-Index facilities. - "XBW" : obtains a compressed trie of the dictionary and transforms it to support all types of queries in highly compressed space. The techniques support different parameterizations to achieve varied and competitive space/time tradeoffs. Compiling the library ==================== 1) For building the libcds library, type "make" within the libcds folder. 2) Once built libcds, type "make" within the root folder and libCSD is ready. Building a dictionary ===================== The library provides a simple command-line script for building dictionaries: ./Build <type> <parameters> <in> <out> - The first parameter chooses the <type> of dictionary to be built. - The set of <parameters> provide specific configuration values. - The <in> parameter locates the original dictionary file. - The <out> parameter tells the destination path. It is worth noting that "in" dictionary file must be lexicographically sorted and strings must be ended with the '\0' ASCII char. Examples: ========= ./Build 1 h 10 geonames dicts/geo.10 Builds a HASH dictionary for "geonames" and stores it as "dicts/geo.10". The dictionary uses a hash table which hash 10% more cells than strings in "geonames" and compresses them using Huffman (h). ./Build 4 r 16 geonames dicts/geo.16 Builds a HTFC (Hu-Tucker Front-Coding) dictionary for "geonames" and stores it as "dicts/geo.16". The dictionary uses buckets of 16 strings and compresses them using Re-Pair. Testing a dictionary ==================== The library also provides a command-line script for testing purposes: ./Test <mode> <opt> <in> <file> - This script supports four different <modes>: - 'r' is used for running the test chosen in <opt>: - 'l' (for testing locate), 'e' (extract). - 'pl' (prefix location), 'pe' (prefix extraction). - 'sl' (substring location), 'pe' (substring extraction). - 'g' is used for generating a basic testbed comprising <opt> valid strings for locate and <opt> valid IDs for extract. - 'p' is used for generating a prefix testbed comprising 100,000 query patterns of some different lengths. The <opt> parameter tells the a reference string length, and the testbed obtain different files with patterns longer and shorter than this length. - 's' is used for generating a substring testbed with the features described above. - The parameter <in> locates the file storing the compressed string dictionary. - The last <file> parameter locates the file comprising the testbed (for 'r' mode) or the destination path for saving the corresponding testbed (modes 'g', 'p', or 's'). Examples: ========= ./Test r l dicts/geo.10 tests/geo.strings Uses the testbed "tests/geo.strings" for analyzing how the dictionary stored at "dicts/geo.10" behaves for 'locate'. ./Test g 100000 dicts/geo.10 tests/geo Uses the dictionary stored at "dicts/geo.10" for generating a basic testbed of 100,000 strings and IDs which are stored as "tests/geo". ./Test p 16 dicts/geo.10 tests/geo Uses the dictionary stored at "dicts/geo.10" for generating a prefix testbed comprising patterns of lengths longer and shorter than 16 chars. The resulting pattern set are stored at "tests/geo". If you find bugs or have any issue with library, please ask us. Enjoy the library and if you find it useful for your research, please cite our paper: @article{MPBCCN16, author = "M.A. Mart{\'{\i}}nez-Prieto and N. Brisaboa and R. C{\'a}novas and F. Claude and G. Navarro", title = "Practical Compressed String Dictionaries", booktitle = "Information Systems", year = "2016", volume = {56}, pages = "73--108" }
About
C++ Library implementing Compressed String Dictionaries
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published