-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Add 8-bit lookup table to Huffman decoder #45
Conversation
Improved Huffman decoding performance by introducing a lookup table for the first 8 bits of each code, reducing bit-by-bit processing in the common case. This optimization achieves around 30% speedup. Optimization details: - Added 256-entry lookup table for first 8 bits - Fast path for codes ≤ 8 bits - Fallback to bit-by-bit decoding for longer codes Benchmark results: Before: goos: darwin goarch: arm64 pkg: github.com/cosnicolaou/pbzip2/internal/bzip2 cpu: Apple M2 Max BenchmarkDecodeDigits-12 363 3222846 ns/op 31.03 MB/s 3613296 B/op 51 allocs/op BenchmarkDecodeDigits-12 366 3483579 ns/op 28.71 MB/s 3613312 B/op 51 allocs/op BenchmarkDecodeDigits-12 361 3212324 ns/op 31.13 MB/s 3613296 B/op 51 allocs/op BenchmarkDecodeNewton-12 92 13106987 ns/op 43.27 MB/s 3631140 B/op 51 allocs/op BenchmarkDecodeNewton-12 94 12964625 ns/op 43.75 MB/s 3631149 B/op 51 allocs/op BenchmarkDecodeNewton-12 92 13332504 ns/op 42.54 MB/s 3631141 B/op 51 allocs/op BenchmarkDecodeRand-12 950 1178515 ns/op 13.90 MB/s 3644460 B/op 51 allocs/op BenchmarkDecodeRand-12 1022 1180043 ns/op 13.88 MB/s 3644463 B/op 51 allocs/op BenchmarkDecodeRand-12 1014 1174522 ns/op 13.95 MB/s 3644462 B/op 51 allocs/op BenchmarkWiktionary-12 1 160628872250 ns/op 65.36 MB/s 367217144 B/op 542483 allocs/op BenchmarkWiktionary-12 1 165487979792 ns/op 63.44 MB/s 367217160 B/op 542483 allocs/op BenchmarkWiktionary-12 1 163573653500 ns/op 64.19 MB/s 367217160 B/op 542483 allocs/op After: goos: darwin goarch: arm64 pkg: github.com/cosnicolaou/pbzip2/internal/bzip2 cpu: Apple M2 Max BenchmarkDecodeDigits-12 586 1983453 ns/op 50.42 MB/s 3616560 B/op 51 allocs/op BenchmarkDecodeDigits-12 600 2009215 ns/op 49.77 MB/s 3616572 B/op 51 allocs/op BenchmarkDecodeDigits-12 602 1990462 ns/op 50.24 MB/s 3616565 B/op 51 allocs/op BenchmarkDecodeNewton-12 134 8641542 ns/op 65.64 MB/s 3634410 B/op 51 allocs/op BenchmarkDecodeNewton-12 136 8632566 ns/op 65.70 MB/s 3634412 B/op 51 allocs/op BenchmarkDecodeNewton-12 138 8686795 ns/op 65.29 MB/s 3634412 B/op 51 allocs/op BenchmarkDecodeRand-12 2049 580980 ns/op 28.20 MB/s 3647723 B/op 51 allocs/op BenchmarkDecodeRand-12 2020 578564 ns/op 28.32 MB/s 3647725 B/op 51 allocs/op BenchmarkDecodeRand-12 2050 581352 ns/op 28.18 MB/s 3647725 B/op 51 allocs/op BenchmarkWiktionary-12 1 116550048542 ns/op 90.08 MB/s 404888416 B/op 542481 allocs/op BenchmarkWiktionary-12 1 118146102375 ns/op 88.87 MB/s 404888432 B/op 542481 allocs/op BenchmarkWiktionary-12 1 117903373792 ns/op 89.05 MB/s 404888432 B/op 542481 allocs/op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. I don't see any improvements in the hot code.
Looking forward to your new version release once this PR gets merged. |
@cosnicolaou Are you planning to release the next version? |
I can, it wasn't clear to me from the last PR if you had more PRs to come? |
All the PRs I was planning to send have been sent and merged. |
done!
Thanks for all of the perf improvements!
Cheers, Cos.
…On Thu, Nov 7, 2024 at 7:59 AM Nao Yonashiro ***@***.***> wrote:
All the PRs I was planning to send have been sent and merged.
—
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACCMUYE433KK6LEAQAFNVTLZ7OE5BAVCNFSM6AAAAABRDA5UDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRSGYYDIOJTGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Improved Huffman decoding performance by introducing a lookup table for the first 8 bits of each code, reducing bit-by-bit processing in the common case. This optimization achieves around 30% speedup.
Optimization details:
Benchmark results:
Before:
After: