Skip to content

Latest commit

 

History

History
21 lines (16 loc) · 797 Bytes

README.md

File metadata and controls

21 lines (16 loc) · 797 Bytes

LineDeduplicator

This is a fast and concurrent deduplication tool that removes duplicate lines in a textfile and leverages multiple cpu cores whilst keeping memory footprint low.

install dependencies

Note: Go 1.9+ is required because of sync.Map.

go get github.com/OneOfOne/xxhash

Features

  • Unique lines will be written to disk
  • Optional: Duplicate lines will be written to disk
  • Non cryptographic hash is used for memory close speed
  • low memory usage because of hashmap lookup
  • Uses all cores of a system and its optimized for 16 cores and more
  • linear performance and ram usage
  • Super fast (when you come from Python or Javascript)
  • Producer-Consumer pattern used to implement concurrency

Memory usage is always 4 bytes for every unique line in the file. set(lines)*4bytes