-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory corruption caused by transformation ukernel #2414
Comments
+@oneapi-src/onednn-cpu-x64, @dzarukin |
@ienkovich, a reproducer would be very helpful in the investigation. |
Here is a reproducer:
|
Hi @ienkovich, thank you for reporting an issue. |
Hello @dzarukin, could you please clarify this restriction? |
Tensor B requires a special format to have computation done on AMX. This special format for bf16 would take every two consecutive rows of original input, put 0th elements together, than 1st elements, up to 15th elements (16 elements from a single row, 32 combined) forming a single row as the AMX FMA instruction requires. Now, this 32 is This special format requires more memory, thus, no out-of-bound writes should happen. There are no other limits when it comes to transform arguments. There's an example with all the APIs and using transform routine available here, please refer to it for certain nuances that are applied. Feel free to ask questions or provide feedback when needed. |
You describe it as a restriction on
VNNI format required by AMX shouldn't require more memory, it doesn't require any specific strides to be used. And your computation of the required size assumes Let's say I have input BF16 buffer:
I use block size 4x4,
So I use something like:
As a result, I would get a complete mess because each transform call would re-write a part of previously written data. Also, some calls would write past the output buffer even though ldb is 16 now. I know this is not a typical use of the transform kernel. I just wanted to demonstrate, that the behavior of this kernel may be quite unexpected. If I pack some data, I expect this data in the output buffer and I don't expect some extra zero bytes to be written. It feels like the real restriction here is on block sizes that can be processed by the kernel. |
Summary
Transformation ukernel for 4x4 BF16 block writes 80 bytes instead of expected 32 bytes
Version
Main branch, commit 281d20d
Environment
SPR CPU
Ubuntu 22.04
Steps to reproduce
Use ukernels for 4x4 BF16 input tensors on a machine with AMX-BF16 support.
Observed behavior
The transformation kernel writes 80 bytes to the output memory, while the data is only 32 bytes long. The generated code has the following encoding instructions:
Here, k7=0xf. The kernel correctly reads two rows, each with four elements, performs the permutation, and then writes 64 bytes instead of 16 bytes. Thus, it writes 48 extra zero bytes.
Expected behavior
Transformation kernel doesn't write extra bytes to the output memory.
The text was updated successfully, but these errors were encountered: