Take token count quantization of fused attention into consideration for CP results correction #1396

xrennvidia · 2025-01-09T03:13:15Z

Description

Fused attention has token count quantization. This can make T of sfotmax_lse mismatch with Q and O. For example, both T=1000 and T=500 will create T=1024 in softmax_lse, so we cannot assume half of T will have 2x smaller shape of softmax_lse. Fix CP results correction by taking this quantization into consideration.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Xiaowei Ren <[email protected]>

for more information, see https://pre-commit.ci

xrennvidia · 2025-01-09T03:26:09Z

/te-ci pytorch L1

cyanguwa

LGTM

xrennvidia and others added 3 commits January 8, 2025 18:35

fix second half lse shape

f8a1394

Signed-off-by: Xiaowei Ren <[email protected]>

bug fixes

0e59706

Signed-off-by: Xiaowei Ren <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e0a8920

for more information, see https://pre-commit.ci

xrennvidia changed the title ~~Take sequence quantization of fused attention into consideration for CP results correction~~ Take token count quantization of fused attention into consideration for CP results correction Jan 9, 2025

xrennvidia requested a review from cyanguwa January 9, 2025 19:16

cyanguwa approved these changes Jan 10, 2025

View reviewed changes

xrennvidia merged commit 7b861e7 into NVIDIA:main Jan 10, 2025
29 checks passed

xrennvidia deleted the xren/cp_lse branch January 10, 2025 19:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Take token count quantization of fused attention into consideration for CP results correction #1396

Take token count quantization of fused attention into consideration for CP results correction #1396

xrennvidia commented Jan 9, 2025 •

edited

Loading

xrennvidia commented Jan 9, 2025

cyanguwa left a comment

Take token count quantization of fused attention into consideration for CP results correction #1396

Take token count quantization of fused attention into consideration for CP results correction #1396

Conversation

xrennvidia commented Jan 9, 2025 • edited Loading

Description

Type of change

Checklist:

xrennvidia commented Jan 9, 2025

cyanguwa left a comment

Choose a reason for hiding this comment

xrennvidia commented Jan 9, 2025 •

edited

Loading