Memory corruption caused by transformation ukernel #2414

ienkovich · 2025-01-15T19:24:16Z

Summary

Transformation ukernel for 4x4 BF16 block writes 80 bytes instead of expected 32 bytes

Version

Main branch, commit 281d20d

Environment

SPR CPU
Ubuntu 22.04

Steps to reproduce

Use ukernels for 4x4 BF16 input tensors on a machine with AMX-BF16 support.

Observed behavior

The transformation kernel writes 80 bytes to the output memory, while the data is only 32 bytes long. The generated code has the following encoding instructions:

   0x7ffff141e1dc:      vmovdqu16 (%rax),%zmm17{%k7}{z}
   0x7ffff141e1e2:      vmovdqu16 0x8(%rax),%zmm2{%k7}{z}
   0x7ffff141e1ec:      vinsertf64x4 $0x1,%ymm2,%zmm17,%zmm17
   0x7ffff141e1f3:      vpermw %zmm17,%zmm1,%zmm17
   0x7ffff141e1f9:      vmovups %zmm17,(%rbx)

Here, k7=0xf. The kernel correctly reads two rows, each with four elements, performs the permutation, and then writes 64 bytes instead of 16 bytes. Thus, it writes 48 extra zero bytes.

Expected behavior

Transformation kernel doesn't write extra bytes to the output memory.

The text was updated successfully, but these errors were encountered:

vpirogov · 2025-01-15T22:03:46Z

+@oneapi-src/onednn-cpu-x64, @dzarukin

vpirogov · 2025-01-15T22:47:08Z

@ienkovich, a reproducer would be very helpful in the investigation.

ienkovich · 2025-01-16T00:04:20Z

Here is a reproducer:

$ cat issue-2414.cpp
#include "oneapi/dnnl/dnnl_types.h"
#include "oneapi/dnnl/dnnl_ukernel.hpp"
#include "oneapi/dnnl/dnnl_ukernel_types.h"

#include <iostream>
#include <vector>

int main(int argc, const char **argv) {
  dnnl::ukernel::transform pack_kernel(
      4, 4,
      dnnl::ukernel::pack_type::no_trans,
      4, 4,
      dnnl::memory::data_type::bf16, dnnl::memory::data_type::bf16);
  pack_kernel.generate();

  std::vector<int16_t> src({
      1, 2, 3, 4,
      5, 6, 7, 8,
      9, 10, 11, 12,
      13, 14, 15, 16 });
  std::vector<int16_t> dst(40, -1);
  pack_kernel.execute(src.data(), dst.data());

  std::vector<int16_t> ref({
      1, 5, 2, 6, 3, 7, 4, 8,
      9, 13, 10, 14, 11, 15, 12, 16,
      -1, -1, -1, -1, -1, -1, -1, -1,
      -1, -1, -1, -1, -1, -1, -1, -1,
      -1, -1, -1, -1, -1, -1, -1, -1});
  for (size_t i = 0; i < ref.size(); ++i) {
    if (ref[i] != dst[i]) {
      std::cout << "FAILED\n  Mismatch at position " << i << ": expected " << ref[i] << ", got " << dst[i] << std::endl;
      return 1;
    }
  }
  std::cout << "PASS" << std::endl;
  return 0;
}
$ g++ issue-2414.cpp -ldnnl
$ ./a.out
FAILED
  Mismatch at position 16: expected -1, got 0

dzarukin · 2025-01-17T02:02:19Z

Hi @ienkovich, thank you for reporting an issue.
It comes from the fact that not a valid ldb for transform routine is passed. Running in debug mode will catch an assert which is not available in release mode, obviously.
The minimum ldb allowed is 16. Once using it, everything works as expected.
I'll update the code to return an error in case of not allowed ldb is passed.

This fixes #2414

ienkovich · 2025-01-17T16:59:10Z

Hello @dzarukin, could you please clarify this restriction?
ldb is just a stride, right? If I use 4x4 input and stride 16, the kernel will still write past the output buffer.
Also, there are no size/stride restrictions in the documentation. Are there other limits other than for ldb?

dzarukin · 2025-01-17T17:36:19Z

Hello @dzarukin, could you please clarify this restriction? ldb is just a stride, right? If I use 4x4 input and stride 16, the kernel will still write past the output buffer. Also, there are no size/stride restrictions in the documentation. Are there other limits other than for ldb?

Tensor B requires a special format to have computation done on AMX. This special format for bf16 would take every two consecutive rows of original input, put 0th elements together, than 1st elements, up to 15th elements (16 elements from a single row, 32 combined) forming a single row as the AMX FMA instruction requires. Now, this 32 is 16 x 2 bf16 elems and this 16 is exactly an ldb size. It can be extended to have this row wider from performance perspective: 16, 32, 48, or 64, it can bring benefits depending on various factors.

This special format requires more memory, thus, no out-of-bound writes should happen.
The minimal size is ldb * div_up(K, 2) * sizeof(bf16) which is 64 bytes for K = 4 in your case, not 16 as you could anticipate.

There are no other limits when it comes to transform arguments.

There's an example with all the APIs and using transform routine available here, please refer to it for certain nuances that are applied. Feel free to ask questions or provide feedback when needed.

ienkovich · 2025-01-17T18:12:42Z

You describe it as a restriction on ldb, which is a stride. But overall it sounds more like a restriction on a block size. If my block size is 4x4, then using ldb=16 makes no sense.

This special format requires more memory, thus, no out-of-bound writes should happen.

VNNI format required by AMX shouldn't require more memory, it doesn't require any specific strides to be used. And your computation of the required size assumes N == ldb, otherwise is non-intuitive. Even if I use N=4 and ldb=16, why am I expected to provide an extra 48 bytes after the last row? What if I use ldb=1024? Am I supposed to allocate an extra KB at the end of the output buffer?

Let's say I have input BF16 buffer:

101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 
201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316
401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416

I use block size 4x4, ldb=16. I want to write into blocks of output buffer to get

101 201 102 202 103 203 104 204 105 205 106 206 107 207 108 208 109 209 110 210 111 211 112 212 113 213 114 214 115 215 116 216 
301 401 302 402 303 403 304 404 305 405 306 406 307 407 308 408 309 409 310 410 311 411 312 412 313 413 314 414 315 415 316 416

So I use something like:

for (int i = 0; i < 4; ++i)
  kernel.execute(in_ptr + 4 * i, out_ptr + 8 * i);

As a result, I would get a complete mess because each transform call would re-write a part of previously written data. Also, some calls would write past the output buffer even though ldb is 16 now.

I know this is not a typical use of the transform kernel. I just wanted to demonstrate, that the behavior of this kernel may be quite unexpected. If I pack some data, I expect this data in the output buffer and I don't expect some extra zero bytes to be written. It feels like the real restriction here is on block sizes that can be processed by the kernel.

This fixes #2414

ienkovich added the sighting Suspicious library behavior. Should be promoted to a bug when confirmed label Jan 15, 2025

vpirogov added the platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 label Jan 15, 2025

dzarukin added a commit that referenced this issue Jan 17, 2025

cpu: x64: ukernel: transform: add incorrect ldb check

2705448

This fixes #2414

dzarukin mentioned this issue Jan 17, 2025

cpu: x64: ukernel: transform: add incorrect ldb check #2436

Merged

dzarukin added a commit that referenced this issue Jan 23, 2025

cpu: x64: ukernel: transform: add incorrect ldb check

8b6a7db

This fixes #2414

dzarukin added a commit that referenced this issue Jan 24, 2025

cpu: x64: ukernel: transform: add incorrect ldb check

9f8ad84

This fixes #2414

dzarukin added a commit that referenced this issue Jan 29, 2025

cpu: x64: ukernel: transform: add incorrect ldb check

3a08866

This fixes #2414

dzarukin added a commit that referenced this issue Jan 29, 2025

cpu: x64: ukernel: transform: add incorrect ldb check

05beb9a

This fixes #2414

dzarukin closed this as completed in #2436 Jan 29, 2025

vpirogov assigned dzarukin Feb 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory corruption caused by transformation ukernel #2414

Memory corruption caused by transformation ukernel #2414

ienkovich commented Jan 15, 2025

vpirogov commented Jan 15, 2025

vpirogov commented Jan 15, 2025

ienkovich commented Jan 16, 2025

dzarukin commented Jan 17, 2025

ienkovich commented Jan 17, 2025

dzarukin commented Jan 17, 2025

ienkovich commented Jan 17, 2025

Memory corruption caused by transformation ukernel #2414

Memory corruption caused by transformation ukernel #2414

Comments

ienkovich commented Jan 15, 2025

Summary

Version

Environment

Steps to reproduce

Observed behavior

Expected behavior

vpirogov commented Jan 15, 2025

vpirogov commented Jan 15, 2025

ienkovich commented Jan 16, 2025

dzarukin commented Jan 17, 2025

ienkovich commented Jan 17, 2025

dzarukin commented Jan 17, 2025

ienkovich commented Jan 17, 2025