Metal: Improved reduce and softmax #1819

ivarflakstad · 2024-03-08T14:48:23Z

Separates strided and contiguous reduce impls
Online softmax impl
Precompilation of metallib via build.rs
Block and simdgroup reductions

Improvements in throughput on my machine (150GiB/s) using f32 ops

	before	after
softmax	21.453 GiB/s	92.261 GiB/s
reduce	14.861 GiB/s	132.95 GiB/s

Co-authored-by: Christopher Fleetwood <[email protected]>

…reduce.

…c which may be suboptimal.

…fast math

Vaibhavs10

cc: @EricLBuehler for vis

FL33TW00D · 2025-01-13T14:27:21Z

Before and after, such a big improvement speed in decode is worth a look 👀

LaurentMazare · 2025-01-13T14:31:23Z

Looks pretty good indeed, @ivarflakstad what is required before this can be merged / any reason while it's still marked as draft ? (also you may want to include the latest changes to main as there are some conflicting files)

ivarflakstad · 2025-01-13T15:43:10Z

Well I guess it's ready now. If anyone wants to test it on some other models etc feel free 🙇

LaurentMazare · 2025-01-13T15:46:42Z

candle-metal-kernels/src/reduce_old.metal

If this is just a copy of the old reduce.metal file and is not used, let's just remove the file and we can always get it from the git history.

LaurentMazare · 2025-01-13T15:48:39Z

candle-metal-kernels/build.rs

How related is this to the reduce/softmax changes? If it's somewhat orthogonal, maybe this should be in a separate PR?

It is only related in the sense that I wanted to see if I could further improve performance.
With a previous version of this PR it was a noticeable improvement, but I just tested and with the current kernels it's ~1% difference.

In other words not needed at the moment

EricLBuehler · 2025-01-13T16:09:54Z

This looks amazing @ivarflakstad!

I tried Llama 3.2 3b on current main:

cargo run --release --features metal --example llama -- --which v32-3b
    Finished `release` profile [optimized] target(s) in 0.22s
     Running `target/release/examples/llama --which v32-3b`
loading the model weights from meta-llama/Llama-3.2-3B
starting the inference loop
My favorite theorem is 2+2=4 and it’s a damn good one. However, it seems like maybe I may need to change my mind now that I’ve seen this.
The theorem states: The sum of the angles in any triangle add up to 180 degrees.
This sounds great but wait! What about the ones with an odd number?
Great! So does the following:
Okay so we know that the angle sums don’t quite add up in triangles but what if you took all triangles and put them together? Then… drum roll… the angles would add up to exactly 360 degrees!
Let’s start by drawing our first triangle.
Now let’s draw another one next to it:
We’ll do the same with a third, and then a fourth…
Finally we will make one more. (now the third point is on the line created from the 2nd and 4th points)
Well, when I first saw this theorem, it was indeed my favorite until now.
As a teacher of geometry I love taking complex problems and getting students excited about them. Much easier said than done though.

220 tokens generated (20.058381492864637 token/s)

But on this branch (e8499c8), I'm getting an error which seems to indicate some numerical issues?

cargo run --release --features metal --example llama -- --which v32-3b
    Finished `release` profile [optimized] target(s) in 0.25s
     Running `target/release/examples/llama --which v32-3b`
loading the model weights from meta-llama/Llama-3.2-3B
starting the inference loop
My favorite theorem is 2+2=4 and it’s a Euclidean Division Theorem!
What kind of joking are you referring to? A mathematician might notError: A weight is invalid in distribution

ivarflakstad · 2025-01-13T16:30:05Z

candle-core/benches/benchmarks/reduce.rs

+// NOTE: Should this be removed? Softmax impls live in candle-nn.
+fn softmax(a: &Tensor) -> candle_core::Result<()> {


@LaurentMazare
I'll have to remove this bench since this is metal specific (softmax lives in candle-nn ops so can't call it from candle-core), but I think softmax warrants a benchmark somehow. Thoughts?

I think moving the benchmark to candle-nn would be good, (do it with git mv so as to preserve history).

ivarflakstad · 2025-01-13T18:09:34Z

Error: A weight is invalid in distribution

Nice, thanks!
Should probably accumulate with float in softmax to preserve precision. Shouldn't affect performance at all.

ivarflakstad and others added 27 commits January 22, 2024 10:51

Improve reduce perf and add contiguous impl

d590284

Improve arg reduce and add contiguous impl

1f4c544

Improve softmax kernel. 33%-39% higher thrpt

2056866

Merge branch 'main' into ivarflakstad/metal-reduce-2

086b6ef

fmt

077e781

Fixed all bugs. Improved code quality. Added tests.

8babfe0

Stash for debugging

cf92c96

Stash for debugging 2

3bc4fcb

Fixing argmax bug and improve performance

2ee1a0c

Co-authored-by: Christopher Fleetwood <[email protected]>

Fix test and add is_valid_simgroup_reduce_type trait

a3fc6c4

Online softmax. Improved threadgroup reduce. Tidying up a bit.

db08d66

Remove redundant threadgroup_barrier from arg reduce

1f63401

Mostly tidying up. Some improvements

dec2db6

Simplify indexed struct

237369c

tidying

bcdbcd1

Reuse operation operator instead of passing it in as a parameter

286598a

Fix how operators are applied to indexed<vec<T,N>>

14652a6

Vectorized load. Scalar block reduce. Hitting max throughput for f32 …

c0d80b4

…reduce.

Vectorized load for online softmax. Involves a reinterpret_cast of sr…

2c78cab

…c which may be suboptimal.

Metal as_type casting vec<bfloat, N> -> vec<float, N/2> for simd and …

847b34f

…fast math

Use constant for input instead of const device. Fix strided reduce.

70dd35f

Use contiguous reduce in tests

8ca2c03

Rename finalize -> to_scalar

a3a5994

Support integer types max/min (switch with trait-inferred impl later)

9b54e41

Was worried I was skipping work -> shuffling the 1D test cases

3633135

Add build.rs to avoid metal kernel jit compile overhead

a9e26b5

Improve build. Extract utils

8186cee

ivarflakstad force-pushed the ivarflakstad/metal-reduce-3 branch from b6d251e to 8186cee Compare March 8, 2024 14:50

ivarflakstad mentioned this pull request Mar 8, 2024

Metal: Improve reduce and softmax kernels #1614

Closed

2 tasks

Compile metal kernels for both macos and ios

568b543

ivarflakstad added 2 commits September 2, 2024 20:47

Merge branch 'main' into ivarflakstad/metal-reduce-3

afa1ea1

Fixed over xmas and then forgot about it

c3e4da7

Vaibhavs10 reviewed Jan 13, 2025

View reviewed changes

ivarflakstad added 2 commits January 13, 2025 16:15

Merge branch 'main' into ivarflakstad/metal-reduce-3

5c59669

Add calculate_reduce_threads util

bcbbad6

ivarflakstad marked this pull request as ready for review January 13, 2025 15:43

ivarflakstad changed the title ~~[WIP] Metal: Improved reduce and softmax~~ Metal: Improved reduce and softmax Jan 13, 2025

LaurentMazare reviewed Jan 13, 2025

View reviewed changes

Remove old reduce.metal

e8499c8

ivarflakstad commented Jan 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal: Improved reduce and softmax #1819

Metal: Improved reduce and softmax #1819

ivarflakstad commented Mar 8, 2024 •

edited

Loading

Vaibhavs10 left a comment

FL33TW00D commented Jan 13, 2025

LaurentMazare commented Jan 13, 2025

ivarflakstad commented Jan 13, 2025

LaurentMazare Jan 13, 2025

LaurentMazare Jan 13, 2025

ivarflakstad Jan 13, 2025

ivarflakstad Jan 13, 2025

EricLBuehler commented Jan 13, 2025

ivarflakstad Jan 13, 2025

LaurentMazare Jan 13, 2025

ivarflakstad commented Jan 13, 2025 •

edited

Loading

		// NOTE: Should this be removed? Softmax impls live in candle-nn.
		fn softmax(a: &Tensor) -> candle_core::Result<()> {

Metal: Improved reduce and softmax #1819

Are you sure you want to change the base?

Metal: Improved reduce and softmax #1819

Conversation

ivarflakstad commented Mar 8, 2024 • edited Loading

Vaibhavs10 left a comment

Choose a reason for hiding this comment

FL33TW00D commented Jan 13, 2025

LaurentMazare commented Jan 13, 2025

ivarflakstad commented Jan 13, 2025

LaurentMazare Jan 13, 2025

Choose a reason for hiding this comment

LaurentMazare Jan 13, 2025

Choose a reason for hiding this comment

ivarflakstad Jan 13, 2025

Choose a reason for hiding this comment

ivarflakstad Jan 13, 2025

Choose a reason for hiding this comment

EricLBuehler commented Jan 13, 2025

ivarflakstad Jan 13, 2025

Choose a reason for hiding this comment

LaurentMazare Jan 13, 2025

Choose a reason for hiding this comment

ivarflakstad commented Jan 13, 2025 • edited Loading

ivarflakstad commented Mar 8, 2024 •

edited

Loading

ivarflakstad commented Jan 13, 2025 •

edited

Loading