Implement full feature of copy/gemm for PVC backend #174

taozha2 · 2024-12-16T01:48:47Z

No description provided.

aacostadiaz

Thanks for adding support for all the data layouts. I’ve left some comments.

include/cutlass/gemm/collective/xe_mma.hpp

examples/sycl/pvc/pvc_gemm.cpp

include/cute/atom/copy_traits_xe.hpp

rolandschulz · 2025-01-08T18:10:01Z

Please don't force push PR's when they are already being reviewed. Makes reviewing harder because one can't look at the added commits since last review.

Implement Feature: 1. Implement full features of copy/MMA for PVC backend We don't implement full copy/gemm functions before this commit because the cutlass cute copy/MMA API is not fully compatible with PVC backend. The register layout loaded by PVC subgroup intrinsic doesn't satisfy the cute::gemm requirement which leads to problems including but not limited to: (1) GEMM can only support specific combination of tile sizes and copy traits. GEMM functionality will be wrong if you try to change tile size configuration or copy traits. For example, the case "examples/sycl/pvc/pvc_gemm.cpp" will fail if you change sg_tile_k from 32 to 64. So we must retile the register data layout before cute::gemm. (2) We have to hardcode to change the register layout to satisfy the requirement of cutlass cute APIs. For example the data from “partition_fragment_B” need to be hardcoded. 2. Support different GEMM layout and data type (1) Support different combinations of RowMajor and ColumnMajor for matrix A and B. Refer to test/unit/cute/intel_xe/gemm_data_type.cpp. (2) Add GEMM test case for int8/uint8/fp16/bf16. Refer to test/unit/cute/intel_xe/gemm_layout.cpp. This PR will implement above features and keep performance not dropped. Refine Code 1. Refine layout convention for gemm. For GEMM C = A x B; let A is (m, k, l), B is (n, k, l), C is (m, n, l), hide backend related differences inside implementation of PVC copy traits(copy_traits_xe.hpp), make it easier for upper-level users to write code for Intel Xe GPU according to cutlass usage habits, don’t let user hardcode for Intel Xe GPU. 2. Refine the API "get_pvc_tensor" Before this PR, we mix K-slicing and coordinate tensor together, which make the interface parameters unclear and difficult to understand. actuualy "K-slicing" is for MMA use, while "coordinate tensor" is only for copy, they are two things, we must keep them functionally independent, so we supply a helper function "append_pvc_tensor".

taozha2 · 2025-01-10T06:45:45Z

Please don't force push PR's when they are already being reviewed. Makes reviewing harder because one can't look at the added commits since last review.

I pushed wrong branch and just restore it.

aacostadiaz · 2025-01-10T10:18:18Z

benchmarks/pvc/benchmarks.hpp

@@ -80,16 +80,49 @@ using PvcGemmBF16BF16FP32_RRR_5 = cutlass::gemm::device::GemmConfiguration<
        TiledMMA<MMAAtom, Layout<Shape<_1,_4,_1>>>,
        XE_2D_U16x8x32_LD_N, XE_2D_U16x32x32_LD_V>;

+using PvcGemmBF16BF16FP32_RRR_6 = cutlass::gemm::device::GemmConfiguration<


RRR stands for ABC data layout. R is row major. C column major.

In this case it should be PvcGemmBF16BF16FP32_RCR

aacostadiaz · 2025-01-10T10:22:08Z

include/cute/algorithm/gemm.hpp

@@ -497,4 +497,107 @@ gemm(MMA_Atom<MMA>       const& mma,
  }
 }

+#if defined(SYCL_INTEL_TARGET)


Why is this needed?

aacostadiaz · 2025-01-10T10:37:36Z

include/cute/atom/copy_traits_xe.hpp

@@ -2022,4 +2068,103 @@ namespace detail
  }
 } // end namespace detail

+template <class TiledCopy, class ThrIdx>


Why is all this needed?

aacostadiaz · 2025-01-10T10:41:56Z

include/cutlass/gemm/collective/xe_mma.hpp

  static constexpr Params
-  to_underlying_arguments(ProblemShape const& problem_shape, Arguments const& args, void* workspace) {
+  to_underlying_arguments(TensorMKL const & tensorA, TensorNKL const &tensorB, void* workspace) {


Why changing the API?

aacostadiaz · 2025-01-10T10:45:52Z

include/cutlass/gemm/collective/xe_mma.hpp

-      for (int i = 0; i < SG_K / SubgroupSize; i++) {
-        cute::gemm(tiled_mma, accum, tCrA(_, _, i), tCrB(_, i, _), src_accum);
-      }
+      cute::gemm(tiled_mma, gmem_tiled_copy_a, gmem_tiled_copy_b, tCrA, tCrB, accum);


Why gemm need the copy operation?

What happen with src_accum?

aacostadiaz · 2025-01-10T10:49:32Z

include/cutlass/gemm/kernel/xe_gemm.hpp

+  static constexpr auto construct_mkl_tensor_A(Arguments const &args) {
+    using LayoutA = cutlass::detail::StrideToLayoutTagA_t<StrideA>;
+
+    auto [M, N, K, L] = args.problem_shape;
+
+    if constexpr (std::is_same_v<LayoutA, cutlass::layout::RowMajor>) {
+      return make_tensor(make_gmem_ptr(static_cast<ElementA const*>(args.mainloop.ptr_A)),
+                            make_layout(make_shape(M,K,L),make_stride((int64_t)K, _1{}, (int64_t)M * K)));
+    } else {
+      return make_tensor(make_gmem_ptr(static_cast<ElementA const*>(args.mainloop.ptr_A)),
+                            make_layout(make_shape(M,K,L), make_stride(_1{}, (int64_t)M, (int64_t)M * K)));
+    }
+  }
+
+  static constexpr auto construct_nkl_tensor_B(Arguments const &args) {
+    using LayoutB = cutlass::detail::StrideToLayoutTagB_t<StrideB>;
+
+    auto [M, N, K, L] = args.problem_shape;
+
+    if constexpr (std::is_same_v<LayoutB, cutlass::layout::RowMajor>) {
+      return make_tensor(make_gmem_ptr(static_cast<ElementB const*>(args.mainloop.ptr_B)),
+                            make_layout(make_shape(N,K,L), make_stride(_1{}, (int64_t)N, (int64_t)N * K)));
+    } else {
+      return make_tensor(make_gmem_ptr(static_cast<ElementB const*>(args.mainloop.ptr_B)),
+                            make_layout(make_shape(N,K,L), make_stride((int64_t)K, _1{}, (int64_t)N * K)));
+    }
+  }


This is not needed. StrideA and StrideB are provided as args by the user

aacostadiaz

There are many different changes in this PR. It would be much easier to review if you separated each change into a different PR.

taozha2 force-pushed the sycl-develop branch 2 times, most recently from 8273148 to 1b76fe3 Compare December 16, 2024 02:19

taozha2 requested review from mehdi-goli, jiyang1011, aacostadiaz, tdeng5 and rolandschulz December 16, 2024 02:20

jiyang1011 approved these changes Dec 16, 2024

View reviewed changes

tdeng5 approved these changes Dec 16, 2024

View reviewed changes

aacostadiaz requested changes Dec 17, 2024

View reviewed changes

include/cutlass/gemm/collective/xe_mma.hpp Outdated Show resolved Hide resolved

examples/sycl/pvc/pvc_gemm.cpp Outdated Show resolved Hide resolved

taozha2 force-pushed the sycl-develop branch from 1b76fe3 to b7f4d66 Compare December 18, 2024 06:20

taozha2 requested review from jiyang1011 and aacostadiaz December 19, 2024 09:11

taozha2 changed the title ~~enable collective column major gemm and add case in example~~ enable collective column major gemm and add case Dec 19, 2024

taozha2 marked this pull request as draft December 23, 2024 13:34

taozha2 force-pushed the sycl-develop branch from d549458 to b4944b9 Compare December 26, 2024 03:27

taozha2 changed the title ~~enable collective column major gemm and add case~~ support different combinations of subgroup tile sizes and copy traits, and also refined code Dec 26, 2024

taozha2 force-pushed the sycl-develop branch from 1b89356 to 29543ce Compare December 27, 2024 03:38

taozha2 changed the title ~~support different combinations of subgroup tile sizes and copy traits, and also refined code~~ Implement full feature of copy/gemm for PVC backend Dec 27, 2024

taozha2 marked this pull request as ready for review December 27, 2024 03:43

rolandschulz reviewed Jan 3, 2025

View reviewed changes

include/cute/atom/copy_traits_xe.hpp Outdated Show resolved Hide resolved

include/cute/atom/copy_traits_xe.hpp Outdated Show resolved Hide resolved

include/cute/atom/copy_traits_xe.hpp Outdated Show resolved Hide resolved

include/cute/atom/copy_traits_xe.hpp Outdated Show resolved Hide resolved

rolandschulz reviewed Jan 3, 2025

View reviewed changes

include/cute/atom/copy_traits_xe.hpp Show resolved Hide resolved

taozha2 force-pushed the sycl-develop branch 3 times, most recently from e5e22ab to b8586e6 Compare January 8, 2025 12:45

taozha2 added 4 commits January 10, 2025 13:19

misc refine

b2524c6

Update copy_traits_xe.hpp

c7c6a38

use make_coord

5548973

taozha2 added 3 commits January 10, 2025 13:19

rename variable to make semantics clear

124c17f

enable tf32 gemm and some refactoring

8ee8600

refine code

08f8d49

taozha2 force-pushed the sycl-develop branch from 72e8400 to 08f8d49 Compare January 10, 2025 05:19

aacostadiaz reviewed Jan 10, 2025

View reviewed changes

aacostadiaz requested changes Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement full feature of copy/gemm for PVC backend #174

Implement full feature of copy/gemm for PVC backend #174

taozha2 commented Dec 16, 2024

aacostadiaz left a comment

rolandschulz commented Jan 8, 2025

taozha2 commented Jan 10, 2025

aacostadiaz Jan 10, 2025

aacostadiaz Jan 10, 2025

aacostadiaz Jan 10, 2025

aacostadiaz Jan 10, 2025

aacostadiaz Jan 10, 2025

aacostadiaz Jan 10, 2025

aacostadiaz left a comment

Implement full feature of copy/gemm for PVC backend #174

Are you sure you want to change the base?

Implement full feature of copy/gemm for PVC backend #174

Conversation

taozha2 commented Dec 16, 2024

aacostadiaz left a comment

Choose a reason for hiding this comment

rolandschulz commented Jan 8, 2025

taozha2 commented Jan 10, 2025

aacostadiaz Jan 10, 2025

Choose a reason for hiding this comment

aacostadiaz Jan 10, 2025

Choose a reason for hiding this comment

aacostadiaz Jan 10, 2025

Choose a reason for hiding this comment

aacostadiaz Jan 10, 2025

Choose a reason for hiding this comment

aacostadiaz Jan 10, 2025

Choose a reason for hiding this comment

aacostadiaz Jan 10, 2025

Choose a reason for hiding this comment

aacostadiaz left a comment

Choose a reason for hiding this comment