Add Flash Attention backward to `benchmarks/triton_kernels_benchmark` #3108

ESI-SYD · 2025-01-07T09:11:47Z

No description provided.

Triton works Xe Update

ESI-SYD · 2025-01-08T02:50:33Z

.github/workflows/triton-benchmarks.yml

+        run: |
+          cd benchmarks/triton_kernels_benchmark
+          FA_KERNEL_MODE="bwd" \
+          BENCHMARKING_METHOD="ELAPSED_TIME" python flash_attention_fwd_benchmark.py --reports $REPORTS


Default UPSTREAM_PYTORCH_PROFILER returns zero value for all providers in bwd mode. Specify ELAPSED_TIME method (06 tutorial used)

Do you know why? FYI @anmyachev

Maybe the launcher hasn't rebuilt with injected PyTorch? @ESI-SYD could you clean ~/.triton/cache and restart benchmarks with UPSTREAM_PYTORCH_PROFILER?

Cleaning ~/.triton/cache doesn't help.

Cleaning ~/.triton/cache doesn't help.

for both Triton and XeTLA?

Cleaning ~/.triton/cache doesn't help.

for both Triton and XeTLA?

Only for Triton.

Yes, for triton fa bwd, __profile_kernel_of_func does not include kernel execution time, Xetla works.

Maybe revert this commit c83c0ed (kernel_name deprecated) helps.
We only keep the method of elapsed_time method in the end?

Thanks for the information, I think we can leave elapsed_time for now, and in the meantime I'll look into why UPSTREAM_PYTORCH_PROFILER mode doesn't work.

I think I figured out the reason. forward and backward run in different threads, so the kernels from backward are not included in __profile_kernel_of_func cpu_children. For example:

(Pdb) functions[0] <FunctionEvent id=3 name=__profile_kernel_of_func device_type=DeviceType.CPU node_id=-1 cpu_time=88.778ms start_us=12366.193 end_us=101144.535 cpu_children=[] xpu_time=0.000us name=__profile_kernel_of_func thread=1 input_shapes=[] cpu_memory_usage=0 xpu_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False> (Pdb) functions[1] <FunctionEvent id=524 name=__profile_kernel_of_func2 device_type=DeviceType.CPU node_id=-1 cpu_time=497.294us start_us=13000.303 end_us=13497.597 cpu_children=[525, 526] xpu_time=87.021ms name=__profile_kernel_of_func2 thread=2 input_shapes=[] cpu_memory_usage=0 xpu_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>

In order to take into account kernels from another thread (from backward function) I added the following:

from torch.profiler import record_function with record_function("__profile_kernel_of_func2"): _attn_bwd_preprocess[pre_grid]( o, do, # delta, # BATCH, N_HEAD, N_CTX, # BLOCK_M=PRE_BLOCK, HEAD_DIM=ctx.HEAD_DIM # ) grid = (N_CTX // BLOCK_N1, 1, BATCH * N_HEAD) _attn_bwd[grid]( q, arg_k, v, ctx.sm_scale, do, dq, dk, dv, # M, delta, # q.stride(0), q.stride(1), q.stride(2), q.stride(3), # N_HEAD, N_CTX, # BLOCK_M1=BLOCK_M1, BLOCK_N1=BLOCK_N1, # BLOCK_M2=BLOCK_M2, BLOCK_N2=BLOCK_N2, # BLK_SLICE_FACTOR=BLK_SLICE_FACTOR, # HEAD_DIM=ctx.HEAD_DIM, # num_warps=NUM_WARPS, # num_stages=NUM_STAGES # )

This problem can be solved by adding additional record_function calls, but maybe just not using inheritance from torch.autograd.Function?

scripts/test-triton.sh

whitneywhtsang · 2025-01-08T03:14:52Z

.github/workflows/triton-benchmarks.yml

+        run: |
+          cd benchmarks/triton_kernels_benchmark
+          FA_KERNEL_MODE="bwd" \
+          BENCHMARKING_METHOD="ELAPSED_TIME" python flash_attention_fwd_benchmark.py --reports $REPORTS


Do you know why? FYI @anmyachev

.github/workflows/triton-benchmarks.yml

scripts/test-triton.sh

.github/workflows/triton-benchmarks.yml

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

Co-authored-by: Whitney Tsang <[email protected]>

This reverts commit e0f25b4.

scripts/test-triton.sh

.github/workflows/triton-benchmarks.yml

ESI-SYD added 2 commits January 7, 2025 09:07

INIT

12deb52

Triton works Xe Update

Temporarily cancel rename for easy review

e0f25b4

ESI-SYD marked this pull request as draft January 7, 2025 09:11

ESI-SYD commented Jan 8, 2025

View reviewed changes

ESI-SYD marked this pull request as ready for review January 8, 2025 02:53

ESI-SYD linked an issue Jan 8, 2025 that may be closed by this pull request

Add Flash Attention backward to benchmarks/triton_kernels_benchmark #2987

Open

ESI-SYD requested review from whitneywhtsang and chengjunlu January 8, 2025 02:55

chengjunlu reviewed Jan 8, 2025

View reviewed changes

scripts/test-triton.sh Outdated Show resolved Hide resolved

whitneywhtsang reviewed Jan 8, 2025

View reviewed changes

.github/workflows/triton-benchmarks.yml Outdated Show resolved Hide resolved

whitneywhtsang reviewed Jan 8, 2025

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

ESI-SYD and others added 4 commits January 9, 2025 10:40

Update .github/workflows/triton-benchmarks.yml

ca72f23

Co-authored-by: Whitney Tsang <[email protected]>

Update scripts/test-triton.sh

7b2db33

Co-authored-by: Whitney Tsang <[email protected]>

Revert "Temporarily cancel rename for easy review"

478e554

This reverts commit e0f25b4.

Address review comments

201273b

ESI-SYD requested review from chengjunlu, anmyachev and whitneywhtsang January 9, 2025 03:37

whitneywhtsang reviewed Jan 9, 2025

View reviewed changes

scripts/test-triton.sh Outdated Show resolved Hide resolved

.github/workflows/triton-benchmarks.yml Outdated Show resolved Hide resolved

Address review comments

94d9b9f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Flash Attention backward to `benchmarks/triton_kernels_benchmark` #3108

Add Flash Attention backward to `benchmarks/triton_kernels_benchmark` #3108

ESI-SYD commented Jan 7, 2025

ESI-SYD Jan 8, 2025

whitneywhtsang Jan 8, 2025

anmyachev Jan 8, 2025

whitneywhtsang Jan 8, 2025

anmyachev Jan 8, 2025

whitneywhtsang Jan 8, 2025

ESI-SYD Jan 9, 2025

anmyachev Jan 9, 2025

anmyachev Jan 9, 2025

whitneywhtsang Jan 8, 2025

Add Flash Attention backward to benchmarks/triton_kernels_benchmark #3108

Are you sure you want to change the base?

Add Flash Attention backward to benchmarks/triton_kernels_benchmark #3108

Conversation

ESI-SYD commented Jan 7, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add Flash Attention backward to `benchmarks/triton_kernels_benchmark` #3108

Add Flash Attention backward to `benchmarks/triton_kernels_benchmark` #3108