Skip to content

Commit

Permalink
Add optimized variant of CMN for HWC to HWC case (#4992)
Browse files Browse the repository at this point in the history
This commit generalizes the optimized variants of Hwc2Chw kernels, by extracting the loading
(from gmem to smem) and writing the output (from smem to gmem) as separate functions, 
that are used as common parts between kernels.
As input layout is the same, the same loading (and cropping) can be applied.
The output writing for CHW and HWC are different, but they stay the same between
the cropping and no-cropping variant.

The sketch of the kernel is described in the docstring.

For HWC->HWC planar storage of the tile in shared memory can be further evaluated.

This version provides up to 2x speedups when running as the only operator within pipeline
(for non-slicing cases).

The computations are done in float as in the original kernel, as the benchmarks shown
no difference compared to using fp16.

Signed-off-by: Krzysztof Lecki <[email protected]>
  • Loading branch information
klecki authored Aug 22, 2023
1 parent cce3c81 commit 301d1a6
Show file tree
Hide file tree
Showing 4 changed files with 512 additions and 232 deletions.
Loading

0 comments on commit 301d1a6

Please sign in to comment.