Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metal: Improved reduce and softmax #1819

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

ivarflakstad
Copy link
Member

@ivarflakstad ivarflakstad commented Mar 8, 2024

  • Separates strided and contiguous reduce impls
  • Online softmax impl
  • Precompilation of metallib via build.rs
  • Block and simdgroup reductions

Improvements in throughput on my machine (150GiB/s) using f32 ops

before after
softmax 21.453 GiB/s 92.261 GiB/s
reduce 14.861 GiB/s 132.95 GiB/s

ivarflakstad and others added 27 commits January 22, 2024 10:51
Co-authored-by: Christopher Fleetwood <[email protected]>
@ivarflakstad ivarflakstad force-pushed the ivarflakstad/metal-reduce-3 branch from b6d251e to 8186cee Compare March 8, 2024 14:50
Copy link
Member

@Vaibhavs10 Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @EricLBuehler for vis

@FL33TW00D
Copy link
Contributor

image
image

Before and after, such a big improvement speed in decode is worth a look 👀

@LaurentMazare
Copy link
Collaborator

Looks pretty good indeed, @ivarflakstad what is required before this can be merged / any reason while it's still marked as draft ? (also you may want to include the latest changes to main as there are some conflicting files)

@ivarflakstad
Copy link
Member Author

Well I guess it's ready now. If anyone wants to test it on some other models etc feel free 🙇

@ivarflakstad ivarflakstad marked this pull request as ready for review January 13, 2025 15:43
@ivarflakstad ivarflakstad changed the title [WIP] Metal: Improved reduce and softmax Metal: Improved reduce and softmax Jan 13, 2025
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is just a copy of the old reduce.metal file and is not used, let's just remove the file and we can always get it from the git history.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How related is this to the reduce/softmax changes? If it's somewhat orthogonal, maybe this should be in a separate PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is only related in the sense that I wanted to see if I could further improve performance.
With a previous version of this PR it was a noticeable improvement, but I just tested and with the current kernels it's ~1% difference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words not needed at the moment

@EricLBuehler
Copy link
Member

This looks amazing @ivarflakstad!

I tried Llama 3.2 3b on current main:

cargo run --release --features metal --example llama -- --which v32-3b
    Finished `release` profile [optimized] target(s) in 0.22s
     Running `target/release/examples/llama --which v32-3b`
loading the model weights from meta-llama/Llama-3.2-3B
starting the inference loop
My favorite theorem is 2+2=4 and it’s a damn good one. However, it seems like maybe I may need to change my mind now that I’ve seen this.
The theorem states: The sum of the angles in any triangle add up to 180 degrees.
This sounds great but wait! What about the ones with an odd number?
Great! So does the following:
Okay so we know that the angle sums don’t quite add up in triangles but what if you took all triangles and put them together? Then… drum roll… the angles would add up to exactly 360 degrees!
Let’s start by drawing our first triangle.
Now let’s draw another one next to it:
We’ll do the same with a third, and then a fourth…
Finally we will make one more. (now the third point is on the line created from the 2nd and 4th points)
Well, when I first saw this theorem, it was indeed my favorite until now.
As a teacher of geometry I love taking complex problems and getting students excited about them. Much easier said than done though.

220 tokens generated (20.058381492864637 token/s)

But on this branch (e8499c8), I'm getting an error which seems to indicate some numerical issues?

cargo run --release --features metal --example llama -- --which v32-3b
    Finished `release` profile [optimized] target(s) in 0.25s
     Running `target/release/examples/llama --which v32-3b`
loading the model weights from meta-llama/Llama-3.2-3B
starting the inference loop
My favorite theorem is 2+2=4 and it’s a Euclidean Division Theorem!
What kind of joking are you referring to? A mathematician might notError: A weight is invalid in distribution

Comment on lines +15 to +16
// NOTE: Should this be removed? Softmax impls live in candle-nn.
fn softmax(a: &Tensor) -> candle_core::Result<()> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LaurentMazare
I'll have to remove this bench since this is metal specific (softmax lives in candle-nn ops so can't call it from candle-core), but I think softmax warrants a benchmark somehow. Thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think moving the benchmark to candle-nn would be good, (do it with git mv so as to preserve history).

@ivarflakstad
Copy link
Member Author

ivarflakstad commented Jan 13, 2025

Error: A weight is invalid in distribution

Nice, thanks!
Should probably accumulate with float in softmax to preserve precision. Shouldn't affect performance at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants