Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory corruption caused by transformation ukernel #2414

Closed
ienkovich opened this issue Jan 15, 2025 · 7 comments · Fixed by #2436
Closed

Memory corruption caused by transformation ukernel #2414

ienkovich opened this issue Jan 15, 2025 · 7 comments · Fixed by #2436
Assignees
Labels
platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 sighting Suspicious library behavior. Should be promoted to a bug when confirmed

Comments

@ienkovich
Copy link

Summary

Transformation ukernel for 4x4 BF16 block writes 80 bytes instead of expected 32 bytes

Version

Main branch, commit 281d20d

Environment

SPR CPU
Ubuntu 22.04

Steps to reproduce

Use ukernels for 4x4 BF16 input tensors on a machine with AMX-BF16 support.

Observed behavior

The transformation kernel writes 80 bytes to the output memory, while the data is only 32 bytes long. The generated code has the following encoding instructions:

   0x7ffff141e1dc:      vmovdqu16 (%rax),%zmm17{%k7}{z}
   0x7ffff141e1e2:      vmovdqu16 0x8(%rax),%zmm2{%k7}{z}
   0x7ffff141e1ec:      vinsertf64x4 $0x1,%ymm2,%zmm17,%zmm17
   0x7ffff141e1f3:      vpermw %zmm17,%zmm1,%zmm17
   0x7ffff141e1f9:      vmovups %zmm17,(%rbx)

Here, k7=0xf. The kernel correctly reads two rows, each with four elements, performs the permutation, and then writes 64 bytes instead of 16 bytes. Thus, it writes 48 extra zero bytes.

Expected behavior

Transformation kernel doesn't write extra bytes to the output memory.

@ienkovich ienkovich added the sighting Suspicious library behavior. Should be promoted to a bug when confirmed label Jan 15, 2025
@vpirogov vpirogov added the platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 label Jan 15, 2025
@vpirogov
Copy link
Member

+@oneapi-src/onednn-cpu-x64, @dzarukin

@vpirogov
Copy link
Member

@ienkovich, a reproducer would be very helpful in the investigation.

@ienkovich
Copy link
Author

Here is a reproducer:

$ cat issue-2414.cpp
#include "oneapi/dnnl/dnnl_types.h"
#include "oneapi/dnnl/dnnl_ukernel.hpp"
#include "oneapi/dnnl/dnnl_ukernel_types.h"

#include <iostream>
#include <vector>

int main(int argc, const char **argv) {
  dnnl::ukernel::transform pack_kernel(
      4, 4,
      dnnl::ukernel::pack_type::no_trans,
      4, 4,
      dnnl::memory::data_type::bf16, dnnl::memory::data_type::bf16);
  pack_kernel.generate();

  std::vector<int16_t> src({
      1, 2, 3, 4,
      5, 6, 7, 8,
      9, 10, 11, 12,
      13, 14, 15, 16 });
  std::vector<int16_t> dst(40, -1);
  pack_kernel.execute(src.data(), dst.data());

  std::vector<int16_t> ref({
      1, 5, 2, 6, 3, 7, 4, 8,
      9, 13, 10, 14, 11, 15, 12, 16,
      -1, -1, -1, -1, -1, -1, -1, -1,
      -1, -1, -1, -1, -1, -1, -1, -1,
      -1, -1, -1, -1, -1, -1, -1, -1});
  for (size_t i = 0; i < ref.size(); ++i) {
    if (ref[i] != dst[i]) {
      std::cout << "FAILED\n  Mismatch at position " << i << ": expected " << ref[i] << ", got " << dst[i] << std::endl;
      return 1;
    }
  }
  std::cout << "PASS" << std::endl;
  return 0;
}
$ g++ issue-2414.cpp -ldnnl
$ ./a.out
FAILED
  Mismatch at position 16: expected -1, got 0

@dzarukin
Copy link
Contributor

Hi @ienkovich, thank you for reporting an issue.
It comes from the fact that not a valid ldb for transform routine is passed. Running in debug mode will catch an assert which is not available in release mode, obviously.
The minimum ldb allowed is 16. Once using it, everything works as expected.
I'll update the code to return an error in case of not allowed ldb is passed.

@ienkovich
Copy link
Author

Hello @dzarukin, could you please clarify this restriction?
ldb is just a stride, right? If I use 4x4 input and stride 16, the kernel will still write past the output buffer.
Also, there are no size/stride restrictions in the documentation. Are there other limits other than for ldb?

@dzarukin
Copy link
Contributor

Hello @dzarukin, could you please clarify this restriction? ldb is just a stride, right? If I use 4x4 input and stride 16, the kernel will still write past the output buffer. Also, there are no size/stride restrictions in the documentation. Are there other limits other than for ldb?

Tensor B requires a special format to have computation done on AMX. This special format for bf16 would take every two consecutive rows of original input, put 0th elements together, than 1st elements, up to 15th elements (16 elements from a single row, 32 combined) forming a single row as the AMX FMA instruction requires. Now, this 32 is 16 x 2 bf16 elems and this 16 is exactly an ldb size. It can be extended to have this row wider from performance perspective: 16, 32, 48, or 64, it can bring benefits depending on various factors.

This special format requires more memory, thus, no out-of-bound writes should happen.
The minimal size is ldb * div_up(K, 2) * sizeof(bf16) which is 64 bytes for K = 4 in your case, not 16 as you could anticipate.

There are no other limits when it comes to transform arguments.

There's an example with all the APIs and using transform routine available here, please refer to it for certain nuances that are applied. Feel free to ask questions or provide feedback when needed.

@ienkovich
Copy link
Author

You describe it as a restriction on ldb, which is a stride. But overall it sounds more like a restriction on a block size. If my block size is 4x4, then using ldb=16 makes no sense.

This special format requires more memory, thus, no out-of-bound writes should happen.

VNNI format required by AMX shouldn't require more memory, it doesn't require any specific strides to be used. And your computation of the required size assumes N == ldb, otherwise is non-intuitive. Even if I use N=4 and ldb=16, why am I expected to provide an extra 48 bytes after the last row? What if I use ldb=1024? Am I supposed to allocate an extra KB at the end of the output buffer?

Let's say I have input BF16 buffer:

101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 
201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316
401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416

I use block size 4x4, ldb=16. I want to write into blocks of output buffer to get

101 201 102 202 103 203 104 204 105 205 106 206 107 207 108 208 109 209 110 210 111 211 112 212 113 213 114 214 115 215 116 216 
301 401 302 402 303 403 304 404 305 405 306 406 307 407 308 408 309 409 310 410 311 411 312 412 313 413 314 414 315 415 316 416

So I use something like:

for (int i = 0; i < 4; ++i)
  kernel.execute(in_ptr + 4 * i, out_ptr + 8 * i);

As a result, I would get a complete mess because each transform call would re-write a part of previously written data. Also, some calls would write past the output buffer even though ldb is 16 now.

I know this is not a typical use of the transform kernel. I just wanted to demonstrate, that the behavior of this kernel may be quite unexpected. If I pack some data, I expect this data in the output buffer and I don't expect some extra zero bytes to be written. It feels like the real restriction here is on block sizes that can be processed by the kernel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 sighting Suspicious library behavior. Should be promoted to a bug when confirmed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants