Lazy minhash implementation for scalability. #653
+93
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is an incremental fix to solve the scalability problem in MinHasher. Current minhasher allocates hash buffer for every element in the init() method only to use it in the plus() method before throwing it away. This should be fine for small number of hashes and relatively small number of items, but causes GC and heap errors when scaled. I've encountered the problem when testing with 25K hashes on several million items. The problem continued when I tried to run it on a much larger dataset on our hadoop cluster using Scalding.
This pull request includes a new LazyMinHasher which just holds the values until aggregation time without doing any buffer allocation. It does buffer allocation in plus() method and then immediately throws it out. Because short lived objects are more easily garbage collected, the burden on GC is significantly lower. Also, because there is only two hash buffers kept in the memory at a given time rather than first allocating all of them, the memory footprint is much lower.
To reproduce the problem, increase the test sample size to ~1M in MinHasherTest.scala, and the new numBands val to ~25K. This should break the MinHasher32 test while LazyMinHasher still passes.