Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Use bit-reversed CRC32 computation with hash/crc32 package #43

Merged
merged 3 commits into from
Oct 31, 2024

Conversation

orisano
Copy link
Contributor

@orisano orisano commented Oct 30, 2024

Improved BZIP2 CRC32 calculation by using bit-reversed values with the standard hash/crc32 package. This allows us to leverage hardware CRC32 instructions:

  • ARM64: 10-12% speedup using RBIT instruction for bit reversal
  • AMD64: ~7% speedup using lookup table for bit reversal

Test data shows consistent improvements on large inputs (>1GB).

Benchmark results (ARM64):
Before:

goos: darwin
goarch: arm64
pkg: github.com/cosnicolaou/pbzip2/internal/bzip2
cpu: Apple M2 Max
BenchmarkDecodeDigits-12    	     330	   3624395 ns/op	  27.59 MB/s	 3612915 B/op	      51 allocs/op
BenchmarkDecodeDigits-12    	     330	   3644010 ns/op	  27.44 MB/s	 3612915 B/op	      51 allocs/op
BenchmarkDecodeDigits-12    	     327	   3732479 ns/op	  26.79 MB/s	 3612947 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	      76	  14911637 ns/op	  38.04 MB/s	 3630765 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	      78	  14745067 ns/op	  38.47 MB/s	 3630768 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	      80	  14724860 ns/op	  38.52 MB/s	 3630758 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	     966	   1359254 ns/op	  12.05 MB/s	 3644075 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	     969	   1265783 ns/op	  12.94 MB/s	 3644089 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	     960	   1255644 ns/op	  13.05 MB/s	 3644082 B/op	      51 allocs/op
BenchmarkWiktionary-12      	       1	205158310917 ns/op	  51.18 MB/s	367216776 B/op	  542483 allocs/op
BenchmarkWiktionary-12      	       1	207003968667 ns/op	  50.72 MB/s	367216776 B/op	  542483 allocs/op
BenchmarkWiktionary-12      	       1	208614686041 ns/op	  50.33 MB/s	367216760 B/op	  542483 allocs/op

After:

goos: darwin
goarch: arm64
pkg: github.com/cosnicolaou/pbzip2/internal/bzip2
cpu: Apple M2 Max
BenchmarkDecodeDigits-12    	     348	   3410393 ns/op	  29.32 MB/s	 3613298 B/op	      51 allocs/op
BenchmarkDecodeDigits-12    	     351	   3401614 ns/op	  29.40 MB/s	 3613299 B/op	      51 allocs/op
BenchmarkDecodeDigits-12    	     351	   3397780 ns/op	  29.43 MB/s	 3613294 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	      88	  13143153 ns/op	  43.16 MB/s	 3631144 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	      90	  13220578 ns/op	  42.90 MB/s	 3631145 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	      86	  13253067 ns/op	  42.80 MB/s	 3631149 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	    1011	   1212203 ns/op	  13.52 MB/s	 3644467 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	    1005	   1216967 ns/op	  13.46 MB/s	 3644460 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	     979	   1227642 ns/op	  13.35 MB/s	 3644458 B/op	      51 allocs/op
BenchmarkWiktionary-12      	       1	182874540791 ns/op	  57.41 MB/s	367217160 B/op	  542483 allocs/op
BenchmarkWiktionary-12      	       1	183373722875 ns/op	  57.26 MB/s	367217160 B/op	  542483 allocs/op
BenchmarkWiktionary-12      	       1	182450789709 ns/op	  57.54 MB/s	367217160 B/op	  542483 allocs/op

Benchmark results (AMD64):
Before:

goos: linux
goarch: amd64
pkg: github.com/cosnicolaou/pbzip2/internal/bzip2
cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
BenchmarkDecodeDigits-8   	     225	   5091263 ns/op	  19.64 MB/s	 3612579 B/op	      51 allocs/op
BenchmarkDecodeDigits-8   	     213	   5065560 ns/op	  19.74 MB/s	 3612580 B/op	      51 allocs/op
BenchmarkDecodeDigits-8   	     236	   5303314 ns/op	  18.86 MB/s	 3612582 B/op	      51 allocs/op
BenchmarkDecodeDigits-8   	     225	   5297896 ns/op	  18.88 MB/s	 3612579 B/op	      51 allocs/op
BenchmarkDecodeDigits-8   	     226	   5209545 ns/op	  19.20 MB/s	 3612580 B/op	      51 allocs/op
BenchmarkDecodeDigits-8   	     192	   5342964 ns/op	  18.72 MB/s	 3612583 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      57	  19633889 ns/op	  28.89 MB/s	 3630421 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      58	  19672671 ns/op	  28.83 MB/s	 3630428 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      61	  19758570 ns/op	  28.71 MB/s	 3630423 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      61	  19739598 ns/op	  28.73 MB/s	 3630422 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      61	  19763141 ns/op	  28.70 MB/s	 3630418 B/op	      50 allocs/op
BenchmarkDecodeNewton-8   	      62	  19736282 ns/op	  28.74 MB/s	 3630428 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     660	   2487447 ns/op	   6.59 MB/s	 3643743 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     519	   2424910 ns/op	   6.76 MB/s	 3643740 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     681	   2246711 ns/op	   7.29 MB/s	 3643742 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     519	   2818677 ns/op	   5.81 MB/s	 3643742 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     480	   3195923 ns/op	   5.13 MB/s	 3643744 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     426	   2545397 ns/op	   6.44 MB/s	 3643741 B/op	      51 allocs/op
BenchmarkWiktionary-8     	       1	255098611022 ns/op	  41.16 MB/s	367215496 B/op	  542483 allocs/op
BenchmarkWiktionary-8     	       1	250397169327 ns/op	  41.93 MB/s	367215480 B/op	  542483 allocs/op
BenchmarkWiktionary-8     	       1	238734724759 ns/op	  43.98 MB/s	367215464 B/op	  542483 allocs/op
BenchmarkWiktionary-8     	       1	244597109758 ns/op	  42.92 MB/s	367215480 B/op	  542483 allocs/op
BenchmarkWiktionary-8     	       1	252179664415 ns/op	  41.63 MB/s	367215464 B/op	  542483 allocs/op
BenchmarkWiktionary-8     	       1	257409744381 ns/op	  40.79 MB/s	367215480 B/op	  542483 allocs/op

After:

goos: linux
goarch: amd64
pkg: github.com/cosnicolaou/pbzip2/internal/bzip2
cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
BenchmarkDecodeDigits-8   	     238	   5338924 ns/op	  18.73 MB/s	 3612966 B/op	      51 allocs/op
BenchmarkDecodeDigits-8   	     192	   5548144 ns/op	  18.02 MB/s	 3612965 B/op	      51 allocs/op
BenchmarkDecodeDigits-8   	     231	   5160329 ns/op	  19.38 MB/s	 3612964 B/op	      51 allocs/op
BenchmarkDecodeDigits-8   	     234	   5213871 ns/op	  19.18 MB/s	 3612967 B/op	      51 allocs/op
BenchmarkDecodeDigits-8   	     237	   5183392 ns/op	  19.29 MB/s	 3612964 B/op	      51 allocs/op
BenchmarkDecodeDigits-8   	     234	   5216613 ns/op	  19.17 MB/s	 3612966 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      66	  18813414 ns/op	  30.15 MB/s	 3630812 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      64	  18738811 ns/op	  30.27 MB/s	 3630815 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      64	  18774511 ns/op	  30.21 MB/s	 3630812 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      64	  18803172 ns/op	  30.17 MB/s	 3630817 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      64	  18793181 ns/op	  30.18 MB/s	 3630815 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      64	  18779499 ns/op	  30.20 MB/s	 3630812 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     710	   2445520 ns/op	   6.70 MB/s	 3644127 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     573	   2337578 ns/op	   7.01 MB/s	 3644126 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     544	   2681357 ns/op	   6.11 MB/s	 3644127 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     354	   2838394 ns/op	   5.77 MB/s	 3644129 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     439	   2403619 ns/op	   6.82 MB/s	 3644126 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     687	   2569978 ns/op	   6.38 MB/s	 3644126 B/op	      51 allocs/op
BenchmarkWiktionary-8     	       1	243108260459 ns/op	  43.19 MB/s	367215880 B/op	  542483 allocs/op
BenchmarkWiktionary-8     	       1	233205853611 ns/op	  45.02 MB/s	367215864 B/op	  542483 allocs/op
BenchmarkWiktionary-8     	       1	225249072544 ns/op	  46.61 MB/s	367215864 B/op	  542483 allocs/op
BenchmarkWiktionary-8     	       1	229793144010 ns/op	  45.69 MB/s	367215864 B/op	  542483 allocs/op
BenchmarkWiktionary-8     	       1	233835578682 ns/op	  44.90 MB/s	367215848 B/op	  542483 allocs/op
BenchmarkWiktionary-8     	       1	241809950266 ns/op	  43.42 MB/s	367215864 B/op	  542483 allocs/op

Improved BZIP2 CRC32 calculation by using bit-reversed values with the standard
hash/crc32 package. This allows us to leverage hardware CRC32 instructions:

- ARM64: 10-12% speedup using RBIT instruction for bit reversal
- AMD64: ~7% speedup using lookup table for bit reversal

Test data shows consistent improvements on large inputs (>1GB).

Benchmark results (ARM64):
Before:
goos: darwin
goarch: arm64
pkg: github.com/cosnicolaou/pbzip2/internal/bzip2
cpu: Apple M2 Max
BenchmarkDecodeDigits-12    	     330	   3624395 ns/op	  27.59 MB/s	 3612915 B/op	      51 allocs/op
BenchmarkDecodeDigits-12    	     330	   3644010 ns/op	  27.44 MB/s	 3612915 B/op	      51 allocs/op
BenchmarkDecodeDigits-12    	     327	   3732479 ns/op	  26.79 MB/s	 3612947 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	      76	  14911637 ns/op	  38.04 MB/s	 3630765 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	      78	  14745067 ns/op	  38.47 MB/s	 3630768 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	      80	  14724860 ns/op	  38.52 MB/s	 3630758 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	     966	   1359254 ns/op	  12.05 MB/s	 3644075 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	     969	   1265783 ns/op	  12.94 MB/s	 3644089 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	     960	   1255644 ns/op	  13.05 MB/s	 3644082 B/op	      51 allocs/op
BenchmarkWiktionary-12      	       1	205158310917 ns/op	  51.18 MB/s	367216776 B/op	  542483 allocs/op
BenchmarkWiktionary-12      	       1	207003968667 ns/op	  50.72 MB/s	367216776 B/op	  542483 allocs/op
BenchmarkWiktionary-12      	       1	208614686041 ns/op	  50.33 MB/s	367216760 B/op	  542483 allocs/op

After:
goos: darwin
goarch: arm64
pkg: github.com/cosnicolaou/pbzip2/internal/bzip2
cpu: Apple M2 Max
BenchmarkDecodeDigits-12    	     348	   3410393 ns/op	  29.32 MB/s	 3613298 B/op	      51 allocs/op
BenchmarkDecodeDigits-12    	     351	   3401614 ns/op	  29.40 MB/s	 3613299 B/op	      51 allocs/op
BenchmarkDecodeDigits-12    	     351	   3397780 ns/op	  29.43 MB/s	 3613294 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	      88	  13143153 ns/op	  43.16 MB/s	 3631144 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	      90	  13220578 ns/op	  42.90 MB/s	 3631145 B/op	      51 allocs/op
BenchmarkDecodeNewton-12    	      86	  13253067 ns/op	  42.80 MB/s	 3631149 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	    1011	   1212203 ns/op	  13.52 MB/s	 3644467 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	    1005	   1216967 ns/op	  13.46 MB/s	 3644460 B/op	      51 allocs/op
BenchmarkDecodeRand-12      	     979	   1227642 ns/op	  13.35 MB/s	 3644458 B/op	      51 allocs/op
BenchmarkWiktionary-12      	       1	182874540791 ns/op	  57.41 MB/s	367217160 B/op	  542483 allocs/op
BenchmarkWiktionary-12      	       1	183373722875 ns/op	  57.26 MB/s	367217160 B/op	  542483 allocs/op
BenchmarkWiktionary-12      	       1	182450789709 ns/op	  57.54 MB/s	367217160 B/op	  542483 allocs/op

Benchmark results (AMD64):
Before:
goos: linux
goarch: amd64
pkg: github.com/cosnicolaou/pbzip2/internal/bzip2
cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
BenchmarkDecodeDigits-8   	     225	   5091263 ns/op	  19.64 MB/s	 3612579 B/op	      51 allocs/op
BenchmarkDecodeDigits-8   	     213	   5065560 ns/op	  19.74 MB/s	 3612580 B/op	      51 allocs/op
BenchmarkDecodeDigits-8   	     236	   5303314 ns/op	  18.86 MB/s	 3612582 B/op	      51 allocs/op
BenchmarkDecodeDigits-8   	     225	   5297896 ns/op	  18.88 MB/s	 3612579 B/op	      51 allocs/op
BenchmarkDecodeDigits-8   	     226	   5209545 ns/op	  19.20 MB/s	 3612580 B/op	      51 allocs/op
BenchmarkDecodeDigits-8   	     192	   5342964 ns/op	  18.72 MB/s	 3612583 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      57	  19633889 ns/op	  28.89 MB/s	 3630421 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      58	  19672671 ns/op	  28.83 MB/s	 3630428 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      61	  19758570 ns/op	  28.71 MB/s	 3630423 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      61	  19739598 ns/op	  28.73 MB/s	 3630422 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      61	  19763141 ns/op	  28.70 MB/s	 3630418 B/op	      50 allocs/op
BenchmarkDecodeNewton-8   	      62	  19736282 ns/op	  28.74 MB/s	 3630428 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     660	   2487447 ns/op	   6.59 MB/s	 3643743 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     519	   2424910 ns/op	   6.76 MB/s	 3643740 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     681	   2246711 ns/op	   7.29 MB/s	 3643742 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     519	   2818677 ns/op	   5.81 MB/s	 3643742 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     480	   3195923 ns/op	   5.13 MB/s	 3643744 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     426	   2545397 ns/op	   6.44 MB/s	 3643741 B/op	      51 allocs/op
BenchmarkWiktionary-8     	       1	255098611022 ns/op	  41.16 MB/s	367215496 B/op	  542483 allocs/op
BenchmarkWiktionary-8     	       1	250397169327 ns/op	  41.93 MB/s	367215480 B/op	  542483 allocs/op
BenchmarkWiktionary-8     	       1	238734724759 ns/op	  43.98 MB/s	367215464 B/op	  542483 allocs/op
BenchmarkWiktionary-8     	       1	244597109758 ns/op	  42.92 MB/s	367215480 B/op	  542483 allocs/op
BenchmarkWiktionary-8     	       1	252179664415 ns/op	  41.63 MB/s	367215464 B/op	  542483 allocs/op
BenchmarkWiktionary-8     	       1	257409744381 ns/op	  40.79 MB/s	367215480 B/op	  542483 allocs/op

After:
goos: linux
goarch: amd64
pkg: github.com/cosnicolaou/pbzip2/internal/bzip2
cpu: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
BenchmarkDecodeDigits-8   	     238	   5338924 ns/op	  18.73 MB/s	 3612966 B/op	      51 allocs/op
BenchmarkDecodeDigits-8   	     192	   5548144 ns/op	  18.02 MB/s	 3612965 B/op	      51 allocs/op
BenchmarkDecodeDigits-8   	     231	   5160329 ns/op	  19.38 MB/s	 3612964 B/op	      51 allocs/op
BenchmarkDecodeDigits-8   	     234	   5213871 ns/op	  19.18 MB/s	 3612967 B/op	      51 allocs/op
BenchmarkDecodeDigits-8   	     237	   5183392 ns/op	  19.29 MB/s	 3612964 B/op	      51 allocs/op
BenchmarkDecodeDigits-8   	     234	   5216613 ns/op	  19.17 MB/s	 3612966 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      66	  18813414 ns/op	  30.15 MB/s	 3630812 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      64	  18738811 ns/op	  30.27 MB/s	 3630815 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      64	  18774511 ns/op	  30.21 MB/s	 3630812 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      64	  18803172 ns/op	  30.17 MB/s	 3630817 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      64	  18793181 ns/op	  30.18 MB/s	 3630815 B/op	      51 allocs/op
BenchmarkDecodeNewton-8   	      64	  18779499 ns/op	  30.20 MB/s	 3630812 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     710	   2445520 ns/op	   6.70 MB/s	 3644127 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     573	   2337578 ns/op	   7.01 MB/s	 3644126 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     544	   2681357 ns/op	   6.11 MB/s	 3644127 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     354	   2838394 ns/op	   5.77 MB/s	 3644129 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     439	   2403619 ns/op	   6.82 MB/s	 3644126 B/op	      51 allocs/op
BenchmarkDecodeRand-8     	     687	   2569978 ns/op	   6.38 MB/s	 3644126 B/op	      51 allocs/op
BenchmarkWiktionary-8     	       1	243108260459 ns/op	  43.19 MB/s	367215880 B/op	  542483 allocs/op
BenchmarkWiktionary-8     	       1	233205853611 ns/op	  45.02 MB/s	367215864 B/op	  542483 allocs/op
BenchmarkWiktionary-8     	       1	225249072544 ns/op	  46.61 MB/s	367215864 B/op	  542483 allocs/op
BenchmarkWiktionary-8     	       1	229793144010 ns/op	  45.69 MB/s	367215864 B/op	  542483 allocs/op
BenchmarkWiktionary-8     	       1	233835578682 ns/op	  44.90 MB/s	367215848 B/op	  542483 allocs/op
BenchmarkWiktionary-8     	       1	241809950266 ns/op	  43.42 MB/s	367215864 B/op	  542483 allocs/op
Copy link
Collaborator

@klauspost klauspost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Lint stuff is fixed in #42

@orisano
Copy link
Contributor Author

orisano commented Oct 31, 2024

I resolved conflicts

@cosnicolaou cosnicolaou merged commit d529180 into cosnicolaou:main Oct 31, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants