File Merger will perform a k-way merge of all input files. This is a partial implementation of an external merge sort, only performing the second half of the function. It assumes all external files it will merge are themselves already sorted based on a merge key.
It is written in the Rust programming language as an initial foray into the language.
- Ability to generate, store and later utilize a cache of files to perform the sort on (this is useful for batch processing)
- Able to merge on any single column
- Supports any delimiter you throw at it (single character)
- Low memory overhead as we only store the 'current' line of each merge file in memory
- Supports different specializations of the merge key, allowing faster merges
-
Clone the repository:
git clone https://github.com/michael-robbins/filemerger.git
-
cd filemerger; cargo build --release
. The binary will now be in./target/release/filemerger
-
Done! Test it out by generating a cache file or performing a direct merge!
Usage: ./file-merger [-h] [-v] -- See below for all options
Options:
-h, --help Print out this help.
-v, --verbose Prints out more info (able to be applied up to 3 times)
--config-file /path/to/config.yaml
Configuration file in YAML that contains most other settings
--delimiter ' ' || ',' || '|'
Raw character we split the line on
--index 0 -> len(line) - 1
Column index we will use for the merge key (0 based)
--glob /path/to/specific_*_files.*.gz
File glob that will provide all required files
--cache-file /path/to/file.cache
Cache file containing files we could merge and their upper and lower merge keys
--key-start 1 Lower bound (starting from and including) merge key
--key-end 10 Upper bound (up to but not including) merge key
--key-type 'Unsigned32Integer' || 'Signed32Integer' || 'String'
The data type of the key used for optimization