Add optimized variant of CMN for HWC to HWC case (#4992) · NVIDIA/DALI@301d1a6

Commit

Add optimized variant of CMN for HWC to HWC case (#4992)

This commit generalizes the optimized variants of Hwc2Chw kernels, by extracting the loading
(from gmem to smem) and writing the output (from smem to gmem) as separate functions, 
that are used as common parts between kernels.
As input layout is the same, the same loading (and cropping) can be applied.
The output writing for CHW and HWC are different, but they stay the same between
the cropping and no-cropping variant.

The sketch of the kernel is described in the docstring.

For HWC->HWC planar storage of the tile in shared memory can be further evaluated.

This version provides up to 2x speedups when running as the only operator within pipeline
(for non-slicing cases).

The computations are done in float as in the original kernel, as the benchmarks shown
no difference compared to using fp16.

Signed-off-by: Krzysztof Lecki <[email protected]>

Loading branch information

klecki authored Aug 22, 2023

1 parent cce3c81 commit 301d1a6

0 comments on commit `301d1a6`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `301d1a6`

Commit

There are no files selected for viewing

0 comments on commit 301d1a6

0 comments on commit `301d1a6`