-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complete 8-bit SATD aarch64 assembly implementation #3280
Conversation
64355bc
to
13439d0
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅ see 1 file with indirect coverage changes 📢 Thoughts on this report? Let us know! |
a387c7d
to
f3567fe
Compare
* d8-d15: Callee-saved registers * v16-v31: Temporary registers
f3567fe
to
21d4caf
Compare
Baseline with initial register assignment patch:
Change with next 2 patches to use an 8x8 kernel for all x8up sizes:
Change from next 2 patches to use 8x8 kernel for 8xH blocks and 16x8 kernel for wider blocks:
Measured on |
Total improvement compared to original scalar Rust implementation:
|
79ea5fe
to
e1b95c0
Compare
e1b95c0
to
6a37ddd
Compare
6a37ddd
to
5555b5f
Compare
Seems very nice, I'd replace |
|
Noting here for future work. Perf report for 10-bit on Graviton2:
|
This roughly matches the architecture of the intrinsics: a central function that the rest fall into. Unlike the generated code from the instrinsics, this uses 8 temporary vector registers for width 8, rather than the minimum of 3. A width-16x kernel is included, admitting a simple loop for the width-8 kernel.