Improve performance by increasing the buffer size to 20K #88
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reading a 5M file takes about 13 seconds. For reference: Python takes about 1.2s for the same file, Go 0.4s, and C++ about 0.25s. So there's a lot to gain!
One reason is that when reading the file in memory it only allocates 1000 bytes at a time; a simple measurement with this:
And some measurements for other values:
I set it to 20K as the performance benefits drop off after that, at least on my system, and it's still quite little memory, but can also use another value if you prefer – 5K or even 2K already make a difference.
(the rest of the performance is mostly in the strcmp()s in check_key() by the way, that loops over all the keys for every key it finds, which is why larger files get so drastically slower).