Enabled running Pallas Flash Attention on CPU. #922

ds-hwang · 2025-01-14T04:59:54Z

Enabled running Pallas Flash Attention on CPU.

Pallas supports CPU simulation (interpret=True), so we can use the same
TPU Pallas kernel on CPU — making code debugging easier.

This change lets the following unittests run on CPU as if they were on TPU,
enabling easier testing and debugging:

axlearn/common/flash_attention/tpu_attention_test.py

Similarly, gpu_attention_test.py can also be run on CPU as if they were on GPU.

axlearn/common/flash_attention/gpu_attention_test.py

Now CI covers those tests on CPU as well.
In M3 Max MacBook Pro, test coverages and processing time are as follows,

axlearn/common/flash_attention/gpu_attention_test.py: 3024 passed, 1345 skipped in 200.38s (0:03:20)
axlearn/common/flash_attention/tpu_attention_test.py: 18 passed, 435 skipped in 34.82s

ds-hwang · 2025-01-14T05:00:22Z

@ruomingp Could you take a look? From 975

ruomingp

A few thoughts missed in earlier reviews...

ruomingp · 2025-01-14T06:09:22Z

axlearn/common/flash_attention/gpu_attention_test.py

@@ -152,6 +153,8 @@ def test_decode_against_ref(
        kv_head_factor: int,
        window_len: int,
    ):
+        if jax.default_backend() != "gpu" and seq_len > 1024:


Nit: can we check it against "cpu" directly instead of != "gpu"?

ruomingp · 2025-01-14T06:10:26Z

axlearn/common/flash_attention/gpu_attention_test.py

@@ -346,6 +357,9 @@ def test_cudnn_against_triton_ref(
    causal: bool,
    dtype: jnp.dtype,
 ):
+    if jax.default_backend() == "cpu":


Likewise, let's avoid assuming that the backend is either gpu or cpu in multiple places.

Suggested change

if jax.default_backend() == "cpu":

if jax.default_backend() != "gpu":

I'll leave this code as-is as you asked

Nit: can we check it against "cpu" directly instead of != "gpu"?

In addition, at the begin of file, it allows only "gpu" and "cpu". So == "cpu" is != "gpu" in this code.

if jax.default_backend() not in ("gpu", "cpu"):

In addition, at the begin of file, it allows only "gpu" and "cpu". So == "cpu" is != "gpu" in this code.

I know you are making this assumption, but such dependency is fragile---what if we extend the supported backends in the future?

In this case, requiring the backend to be "gpu" is both more robust and readable. What's the downside?

ruomingp · 2025-01-14T06:10:48Z

axlearn/common/flash_attention/gpu_attention_test.py

+    if jax.default_backend() == "cpu":
+        pytest.skip(reason="cudnn function needs GPU.")


And here and elsewhere.

As mentioned above, keep using jax.default_backend() == "cpu":

ruomingp · 2025-01-14T06:12:42Z

axlearn/common/flash_attention/tpu_attention_test.py

-        seq_len=[1024, 32768],
+        seq_len=[1024],


Since the sliding window size is 1024, it will be useful to keep a test case for seq_len > 1024. We can enable the test only on TPU if it's too slow on CPU. We can also use a seq_len such as 2048 for cpu if it's fast enough.

Done. I changed it back to resume the first PR's code.

We had this thread in 975

@ruomingp Do we need to support seq_len up to 1024? If the block size is 128, supporting <= 256 should be enough?

@ds-hwang Agreed. I removed 32k test with this if-statement.

ruomingp · 2025-01-14T06:14:10Z

axlearn/common/flash_attention/utils.py

                softmax_scale=softmax_scale,
                block_size=block_size,
+                interpret=(backend == "cpu"),


Given how often we do this across locations, I wonder if we can do the following:

Make interpret default to None (instead of False);

If it's None, assume interpret=True if the backend is "cpu";

WDYT?

Thank you for your suggestion. interpret=True applies only to the Pallas kernel. Therefore, having an interpret variable in the flash layer is not aligned with the appropriate level of abstraction—neither the JAX fallback nor the cudnn code paths needs this variable.

Additionally, this line was added so contributors can easily debug the Pallas kernel on the CPU. For instance, changing the if statement to:

elif backend in ("cpu", "tpu"):

would allow debugging in layer_test.py.

ds-hwang

Thank you for review. I responded all comments. Could you check it again?

ds-hwang · 2025-01-14T16:19:30Z

axlearn/common/flash_attention/utils.py

                softmax_scale=softmax_scale,
                block_size=block_size,
+                interpret=(backend == "cpu"),


Thank you for your suggestion. interpret=True applies only to the Pallas kernel. Therefore, having an interpret variable in the flash layer is not aligned with the appropriate level of abstraction—neither the JAX fallback nor the cudnn code paths needs this variable.

Additionally, this line was added so contributors can easily debug the Pallas kernel on the CPU. For instance, changing the if statement to:

elif backend in ("cpu", "tpu"):

would allow debugging in layer_test.py.

ds-hwang · 2025-01-14T16:19:59Z

axlearn/common/flash_attention/gpu_attention_test.py

@@ -152,6 +153,8 @@ def test_decode_against_ref(
        kv_head_factor: int,
        window_len: int,
    ):
+        if jax.default_backend() != "gpu" and seq_len > 1024:


ds-hwang · 2025-01-14T16:21:02Z

axlearn/common/flash_attention/gpu_attention_test.py

@@ -346,6 +357,9 @@ def test_cudnn_against_triton_ref(
    causal: bool,
    dtype: jnp.dtype,
 ):
+    if jax.default_backend() == "cpu":


I'll leave this code as-is as you asked

Nit: can we check it against "cpu" directly instead of != "gpu"?

In addition, at the begin of file, it allows only "gpu" and "cpu". So == "cpu" is != "gpu" in this code.

if jax.default_backend() not in ("gpu", "cpu"):

ds-hwang · 2025-01-14T16:23:02Z

axlearn/common/flash_attention/gpu_attention_test.py

+    if jax.default_backend() == "cpu":
+        pytest.skip(reason="cudnn function needs GPU.")


As mentioned above, keep using jax.default_backend() == "cpu":

ruomingp

It seems possible to support interpret=None to simplify the code, with the following behavior:

interpret=True/False: enable/disable interpret;
interpret=None: let the implementation choose whether to interpret, depending on whether pallas is used and running on cpu vs. accelerator;

But I don't want to block this PR, as we can simplify it later.

ds-hwang · 2025-01-16T16:35:10Z

Thank you for review!

Pallas supports CPU simulation (`interpret=True`), so we can use the same TPU Pallas kernel on CPU — making code debugging easier. This change lets the following unittests run on CPU as if they were on TPU, enabling easier testing and debugging: - `axlearn/common/flash_attention/tpu_attention_test.py` Similarly, `gpu_attention_test.py` can also be run on CPU as if they were on GPU. - `axlearn/common/flash_attention/gpu_attention_test.py` Now CI covers those tests on CPU as well. In M3 Max MacBook Pro, test coverages and processing time are as follows, * axlearn/common/flash_attention/gpu_attention_test.py: 3024 passed, 1345 skipped in 200.38s (0:03:20) * axlearn/common/flash_attention/tpu_attention_test.py: 18 passed, 435 skipped in 34.82s

ds-hwang requested review from ruomingp, markblee and a team as code owners January 14, 2025 04:59

ds-hwang force-pushed the flsh_cpu branch from e924937 to ed75e10 Compare January 14, 2025 06:07

ruomingp reviewed Jan 14, 2025

View reviewed changes

ds-hwang force-pushed the flsh_cpu branch from ed75e10 to 0f08e6b Compare January 14, 2025 16:24

ds-hwang commented Jan 14, 2025

View reviewed changes

ds-hwang requested a review from ruomingp January 14, 2025 16:25

ruomingp approved these changes Jan 16, 2025

View reviewed changes

ds-hwang added this pull request to the merge queue Jan 16, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 16, 2025

ds-hwang added this pull request to the merge queue Jan 16, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 16, 2025

ds-hwang force-pushed the flsh_cpu branch from 0f08e6b to 6168308 Compare January 16, 2025 20:01

ds-hwang enabled auto-merge January 16, 2025 20:01

ds-hwang force-pushed the flsh_cpu branch 2 times, most recently from 0e39bdd to 3f4a177 Compare January 21, 2025 22:30

ds-hwang force-pushed the flsh_cpu branch from 3f4a177 to 4cb5015 Compare January 22, 2025 21:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabled running Pallas Flash Attention on CPU. #922

Enabled running Pallas Flash Attention on CPU. #922

ds-hwang commented Jan 14, 2025

ds-hwang commented Jan 14, 2025

ruomingp left a comment

ruomingp Jan 14, 2025

ds-hwang Jan 14, 2025

ruomingp Jan 14, 2025

ds-hwang Jan 14, 2025

ruomingp Jan 14, 2025

ruomingp Jan 14, 2025

ds-hwang Jan 14, 2025

ruomingp Jan 14, 2025

ds-hwang Jan 14, 2025 •

edited

Loading

ruomingp Jan 14, 2025

ds-hwang Jan 14, 2025

ds-hwang left a comment

ds-hwang Jan 14, 2025

ds-hwang Jan 14, 2025

ds-hwang Jan 14, 2025

ds-hwang Jan 14, 2025

ruomingp left a comment

ds-hwang commented Jan 16, 2025

	if jax.default_backend() == "cpu":
	if jax.default_backend() != "gpu":

		if jax.default_backend() == "cpu":
		pytest.skip(reason="cudnn function needs GPU.")

Enabled running Pallas Flash Attention on CPU. #922

Are you sure you want to change the base?

Enabled running Pallas Flash Attention on CPU. #922

Conversation

ds-hwang commented Jan 14, 2025

ds-hwang commented Jan 14, 2025

ruomingp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ds-hwang Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ds-hwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruomingp left a comment

Choose a reason for hiding this comment

ds-hwang commented Jan 16, 2025

ds-hwang Jan 14, 2025 •

edited

Loading