Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add optimized variant of CMN for HWC to HWC case (#4992)
This commit generalizes the optimized variants of Hwc2Chw kernels, by extracting the loading (from gmem to smem) and writing the output (from smem to gmem) as separate functions, that are used as common parts between kernels. As input layout is the same, the same loading (and cropping) can be applied. The output writing for CHW and HWC are different, but they stay the same between the cropping and no-cropping variant. The sketch of the kernel is described in the docstring. For HWC->HWC planar storage of the tile in shared memory can be further evaluated. This version provides up to 2x speedups when running as the only operator within pipeline (for non-slicing cases). The computations are done in float as in the original kernel, as the benchmarks shown no difference compared to using fp16. Signed-off-by: Krzysztof Lecki <[email protected]>
- Loading branch information