Make fp8 compatible with tensor parallelism #65

lw · 2024-12-31T10:20:51Z

No description provided.

[ghstack-poisoned]

lw · 2024-12-31T10:20:52Z

Stack from ghstack (oldest at bottom):

-> Make fp8 compatible with tensor parallelism #65

ghstack-source-id: db07e928f48cb886a86e017755ec4372c0f7ec3e ghstack-comment-id: 2566319697 Pull Request resolved: #65

tianyu-l · 2025-01-06T02:54:00Z

lingua/float8.py

+        return 1
+
+
+def mul_tiled(a, *bs):


I understand that you need to apply such a function in the test of pytorch/pytorch#143760 to "manually" do tiled multiplication/division to compute scaled results.
Here "if b is m x n" only appears when it's DTensor sub-row-wise scaling, in which case the local tensor of b would always have m x 1 shape. So is it correct that:

on L38 with local_map we can always assume no tiled multiplication is needed; and

on L46 if you're willing to also use local_map, tile multiplication can be avoided too.

Thanks for taking a look!

I am learning to use DTensors and I thought it was more idiomatic to express the calculation on the "global" distributed tensor, rather than on the local shard. In order to do so, however, we need to know how many shards there are and reshape accordingly, which arguably isn't that pretty either.

I believe the "fundamental" reason for it is that we're stacking the different components in the wrong order. Here we first replace the matmuls with our custom function, and then we propagate DTensors through it (which means our function needs to know how to handle DTensors). However, I believe the ideal solution would be to first propagate DTensors through some regular matmuls, then take the resulting graph and swap the local matmuls with our function. The issue is that I don't really know how to achieve that, and our code was already written this way before we started supporting DTensors.

(There's also another open question which is how to integrate this with async-TP)

As for local_map, this is currently an unfortunate implementation detail. Ideally the scaling is supposed to be done by the _scaled_mm operator internally, which is what it does! However, because the row-wise scaled-mm is slow (when using slow accum), we use the tensor-wise (un)scaled-mm and do the scaling ourselves. If we were able to make the row-wise scaled-mm faster we could avoid local_map altogether.

Thanks for the analysis! Makes a lot of sense to me!

Update

e5e8b96

[ghstack-poisoned]

lw added a commit that referenced this pull request Dec 31, 2024

Make fp8 compatible with tensor parallelism

d7aaa73

ghstack-source-id: db07e928f48cb886a86e017755ec4372c0f7ec3e ghstack-comment-id: 2566319697 Pull Request resolved: #65

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 31, 2024

lw mentioned this pull request Dec 31, 2024

[DTensor] Add strategy for _scaled_mm pytorch/pytorch#143760

Closed

tianyu-l reviewed Jan 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make fp8 compatible with tensor parallelism #65

Make fp8 compatible with tensor parallelism #65

lw commented Dec 31, 2024 •

edited

Loading

lw commented Dec 31, 2024 •

edited

Loading

tianyu-l Jan 6, 2025

lw Jan 6, 2025

tianyu-l Jan 7, 2025

Make fp8 compatible with tensor parallelism #65

Are you sure you want to change the base?

Make fp8 compatible with tensor parallelism #65

Conversation

lw commented Dec 31, 2024 • edited Loading

lw commented Dec 31, 2024 • edited Loading

tianyu-l Jan 6, 2025

Choose a reason for hiding this comment

lw Jan 6, 2025

Choose a reason for hiding this comment

tianyu-l Jan 7, 2025

Choose a reason for hiding this comment

lw commented Dec 31, 2024 •

edited

Loading

lw commented Dec 31, 2024 •

edited

Loading