Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Add 8-bit lookup table to Huffman decoder #45

Merged
merged 4 commits into from
Nov 4, 2024

Conversation

orisano
Copy link
Contributor

@orisano orisano commented Nov 3, 2024

Improved Huffman decoding performance by introducing a lookup table for the first 8 bits of each code, reducing bit-by-bit processing in the common case. This optimization achieves around 30% speedup.

Optimization details:

  • Added 256-entry lookup table for first 8 bits
  • Fast path for codes ≤ 8 bits
  • Fallback to bit-by-bit decoding for longer codes

Benchmark results:
Before:

goos: darwin
goarch: arm64
pkg: github.com/cosnicolaou/pbzip2/internal/bzip2
cpu: Apple M2 Max
BenchmarkDecodeDigits-12    	     363	   3222846 ns/op	  31.03 MB/s	 3613296 B/op	      51 allocs/op
BenchmarkDecodeDigits-12    	     366	   3483579 ns/op	  28.71 MB/s	 3613312 B/op	      51 allocs/op
BenchmarkDecodeDigits-12    	     361	   3212324 ns/op	  31.13 MB/s	 3613296 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	      92	  13106987 ns/op	  43.27 MB/s	 3631140 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	      94	  12964625 ns/op	  43.75 MB/s	 3631149 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	      92	  13332504 ns/op	  42.54 MB/s	 3631141 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	     950	   1178515 ns/op	  13.90 MB/s	 3644460 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	    1022	   1180043 ns/op	  13.88 MB/s	 3644463 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	    1014	   1174522 ns/op	  13.95 MB/s	 3644462 B/op	      51 allocs/op
BenchmarkWiktionary-12      	       1	160628872250 ns/op	  65.36 MB/s	367217144 B/op	  542483 allocs/op
BenchmarkWiktionary-12      	       1	165487979792 ns/op	  63.44 MB/s	367217160 B/op	  542483 allocs/op
BenchmarkWiktionary-12      	       1	163573653500 ns/op	  64.19 MB/s	367217160 B/op	  542483 allocs/op

After:

goos: darwin
goarch: arm64
pkg: github.com/cosnicolaou/pbzip2/internal/bzip2
cpu: Apple M2 Max
BenchmarkDecodeDigits-12    	     586	   1983453 ns/op	  50.42 MB/s	 3616560 B/op	      51 allocs/op
BenchmarkDecodeDigits-12    	     600	   2009215 ns/op	  49.77 MB/s	 3616572 B/op	      51 allocs/op
BenchmarkDecodeDigits-12    	     602	   1990462 ns/op	  50.24 MB/s	 3616565 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	     134	   8641542 ns/op	  65.64 MB/s	 3634410 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	     136	   8632566 ns/op	  65.70 MB/s	 3634412 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	     138	   8686795 ns/op	  65.29 MB/s	 3634412 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	    2049	    580980 ns/op	  28.20 MB/s	 3647723 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	    2020	    578564 ns/op	  28.32 MB/s	 3647725 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	    2050	    581352 ns/op	  28.18 MB/s	 3647725 B/op	      51 allocs/op
BenchmarkWiktionary-12      	       1	116550048542 ns/op	  90.08 MB/s	404888416 B/op	  542481 allocs/op
BenchmarkWiktionary-12      	       1	118146102375 ns/op	  88.87 MB/s	404888432 B/op	  542481 allocs/op
BenchmarkWiktionary-12      	       1	117903373792 ns/op	  89.05 MB/s	404888432 B/op	  542481 allocs/op

Improved Huffman decoding performance by introducing a lookup table for the first
8 bits of each code, reducing bit-by-bit processing in the common case. This
optimization achieves around 30% speedup.

Optimization details:
- Added 256-entry lookup table for first 8 bits
- Fast path for codes ≤ 8 bits
- Fallback to bit-by-bit decoding for longer codes

Benchmark results:
Before:
goos: darwin
goarch: arm64
pkg: github.com/cosnicolaou/pbzip2/internal/bzip2
cpu: Apple M2 Max
BenchmarkDecodeDigits-12    	     363	   3222846 ns/op	  31.03 MB/s	 3613296 B/op	      51 allocs/op
BenchmarkDecodeDigits-12    	     366	   3483579 ns/op	  28.71 MB/s	 3613312 B/op	      51 allocs/op
BenchmarkDecodeDigits-12    	     361	   3212324 ns/op	  31.13 MB/s	 3613296 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	      92	  13106987 ns/op	  43.27 MB/s	 3631140 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	      94	  12964625 ns/op	  43.75 MB/s	 3631149 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	      92	  13332504 ns/op	  42.54 MB/s	 3631141 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	     950	   1178515 ns/op	  13.90 MB/s	 3644460 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	    1022	   1180043 ns/op	  13.88 MB/s	 3644463 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	    1014	   1174522 ns/op	  13.95 MB/s	 3644462 B/op	      51 allocs/op
BenchmarkWiktionary-12      	       1	160628872250 ns/op	  65.36 MB/s	367217144 B/op	  542483 allocs/op
BenchmarkWiktionary-12      	       1	165487979792 ns/op	  63.44 MB/s	367217160 B/op	  542483 allocs/op
BenchmarkWiktionary-12      	       1	163573653500 ns/op	  64.19 MB/s	367217160 B/op	  542483 allocs/op

After:
goos: darwin
goarch: arm64
pkg: github.com/cosnicolaou/pbzip2/internal/bzip2
cpu: Apple M2 Max
BenchmarkDecodeDigits-12    	     586	   1983453 ns/op	  50.42 MB/s	 3616560 B/op	      51 allocs/op
BenchmarkDecodeDigits-12    	     600	   2009215 ns/op	  49.77 MB/s	 3616572 B/op	      51 allocs/op
BenchmarkDecodeDigits-12    	     602	   1990462 ns/op	  50.24 MB/s	 3616565 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	     134	   8641542 ns/op	  65.64 MB/s	 3634410 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	     136	   8632566 ns/op	  65.70 MB/s	 3634412 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	     138	   8686795 ns/op	  65.29 MB/s	 3634412 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	    2049	    580980 ns/op	  28.20 MB/s	 3647723 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	    2020	    578564 ns/op	  28.32 MB/s	 3647725 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	    2050	    581352 ns/op	  28.18 MB/s	 3647725 B/op	      51 allocs/op
BenchmarkWiktionary-12      	       1	116550048542 ns/op	  90.08 MB/s	404888416 B/op	  542481 allocs/op
BenchmarkWiktionary-12      	       1	118146102375 ns/op	  88.87 MB/s	404888432 B/op	  542481 allocs/op
BenchmarkWiktionary-12      	       1	117903373792 ns/op	  89.05 MB/s	404888432 B/op	  542481 allocs/op
Copy link
Collaborator

@klauspost klauspost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. I don't see any improvements in the hot code.

internal/bzip2/huffman.go Outdated Show resolved Hide resolved
@orisano
Copy link
Contributor Author

orisano commented Nov 4, 2024

Looking forward to your new version release once this PR gets merged.

@cosnicolaou cosnicolaou merged commit c91e1ca into cosnicolaou:main Nov 4, 2024
8 checks passed
@orisano
Copy link
Contributor Author

orisano commented Nov 7, 2024

@cosnicolaou Are you planning to release the next version?

@cosnicolaou
Copy link
Owner

I can, it wasn't clear to me from the last PR if you had more PRs to come?

@orisano
Copy link
Contributor Author

orisano commented Nov 7, 2024

All the PRs I was planning to send have been sent and merged.

@cosnicolaou
Copy link
Owner

cosnicolaou commented Nov 7, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants