Replies: 8 comments
-
Yeah this may be one drawback of the rust ecosystem: older C/C++ software gets ported a lot but probably without as much care as the original authors put in. I'll try building on 1e11 keys overnight for various 128bit hash functions and see how that goes. Do you use all entropy in the hashes for your construction? I wonder why xxh3 breaks for you but not for me. (But maybe a 1% sample simply isn't large enough to find collisions.) |
Beta Was this translation helpful? Give feedback.
-
Anyway, I tested not in isolation CityHash and SpookyHash, and SpookyHash is the clear winner. My feeling is that what I always predicate about PRNGs is true also for hash functions: test in the application, not in isolation. The stuff that the compiler can do in tight loops (e.g., different level of loop unrolling) makes benchmarks extremely unreliable. It might be possible that with some careful testing that avoid speculation, parallelism, unrolling etc. one can get a result closer to reality, but for the time being I'm relying only on testing in the application. |
Beta Was this translation helpful? Give feedback.
-
Yeah that's fair. I started wondering: do you think that many hash functions are actually injective on u64 keys? Or maybe even bijective on their low 64 bits? Since I always use all those bits that would mean I will never run into collision problems on such a dataset for such hashes, and I basically need to test on over 4e9 strings to hit problems. (But I don't have a 40GB dataset lying around at the moment, nor is my harddrive large enough to go much beyond 100GB of data :/.) |
Beta Was this translation helpful? Give feedback.
-
I don't know of any hash function with that property, and I think it is impossible to have unless you accept to fail many statistical tests. |
Beta Was this translation helpful? Give feedback.
-
Well, I believe murmur2 is invertible, as in: for every 64bit hash you can find a 64bit key that hashes to it. See this blogpost in particular. That means that murmur2 is a bijection on [0, 2^64) -> [0, 2^64), and hence integer keys can never give collisions. In particular it seems reasonable to me that most hash functions should satisfy a similar property to minimize the probability of collisions in general. But then again, murmur2 does fail some tests :) |
Beta Was this translation helpful? Give feedback.
-
Ah, I misread—you talked about the lower bits and this conversation is named "128-bit hash functions" so I thought you were looking for a 128-bit hash function whose lower 64 bits are in bijection with 64-bit keys. That's difficult of obtain, in my opinion, for a 128-bit function. But maybe you can just generate two different 64-bit value, each value in bijection with the first 8 bytes of the key, and just use those juxtaposed. Not sure about the statistical properties you'd get tho. If the output is 64 bit, yes, why not, you might make it invertible. |
Beta Was this translation helpful? Give feedback.
-
Just came across gxhash: Uses AES intrinsics, passes SMHasher and is apparently quite a bit faster than other methods. Should try it at some point |
Beta Was this translation helpful? Give feedback.
-
Just came here to post this! Rust subreddit? ;P |
Beta Was this translation helpful? Give feedback.
-
We can move the discussion from #4 here.
Running your tests and checking against SMHasher's, it is clear that:
Presently I'm trying CityHash, which turns out to be really, really fast, and in theory passes all tests—so maybe it works for me, too. Let's see. But shopping on SMHasher does not work—Rust implementations have a lot of latitude in quality.
BTW, on 64-bit integers SpookyHash is slightly faster, but frankly 64-bit keys have little use beside benchmarking.
Beta Was this translation helpful? Give feedback.
All reactions