-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantized GGUF style #1523
Quantized GGUF style #1523
Conversation
Thanks for making this PR. It's really large and very hard to review and it seems to contain a bunch of orthogonal things. |
d68e82e
to
b89c02e
Compare
candle-core/src/tensor.rs
Outdated
@@ -426,7 +426,9 @@ impl Tensor { | |||
if buffer_size != shape.elem_count() { | |||
return Err(Error::ShapeMismatch { buffer_size, shape }.bt()); | |||
} | |||
// println!("from vec {buffer_size}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably want these gone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
@@ -1 +0,0 @@ | |||
pub const LAYERNORM_KERNELS: &str = include_str!(concat!(env!("OUT_DIR"), "/layernorm_kernels.ptx")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm building without cuda will do that it seems.
candle-examples/examples/phi/main.rs
Outdated
&filenames[0], | ||
&device, | ||
)?; | ||
println!("Loaded vb"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed
let mut total_size_in_bytes = 0; | ||
for (_, tensor) in model.tensors.iter() { | ||
let elem_count = tensor.shape().elem_count(); | ||
total_size_in_bytes += | ||
elem_count * tensor.dtype().type_size() / tensor.dtype().blck_size(); | ||
elem_count * tensor.dtype().type_size() / tensor.dtype().block_size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could make this a method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a method, I don't get what you are implying here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this file we do elem_count * tensor.dtype().type_size() / tensor.dtype().block_size();
Could be a method: tensor.size_in_bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, but let's no do this in this PR though I think, no ?
candle-metal-kernels/src/cast.metal
Outdated
@@ -86,7 +86,6 @@ CAST(cast_i64_f32, cast_i64_f32_strided, int64_t, float) | |||
#endif | |||
|
|||
#if defined(__HAVE_BFLOAT__) | |||
#if __METAL_VERSION__ >= 310 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No because of __HAVE_BFLOAT__
above.
candle-kernels/src/lib.rs
Outdated
pub const INDEXING: &str = include_str!(concat!(env!("OUT_DIR"), "/indexing.ptx")); | ||
pub const REDUCE: &str = include_str!(concat!(env!("OUT_DIR"), "/reduce.ptx")); | ||
pub const TERNARY: &str = include_str!(concat!(env!("OUT_DIR"), "/ternary.ptx")); | ||
pub const UNARY: &str = include_str!(concat!(env!("OUT_DIR"), "/unary.ptx")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops nice catch I must have messed up the build.
|
||
#[derive(Debug, Clone, Copy)] | ||
pub enum GgmlDType { | ||
Q4_0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/rustformers/llm use an FFI here to keep in line, something to consider for the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we need FFI ? This is purely for internal use to call the correct kernel with the correct warps. (Each kernel is written with specific warps in mind)
r3 | ||
) | ||
); | ||
encoder.set_threadgroup_memory_length(0, 8192); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8192 is fine, could introduce a version check for this in future: https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is copy-pasted from ggml for now. And the kernels expect 8192.
} | ||
} | ||
|
||
kernel void kernel_norm( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can go faster ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used :) If you're up for it, we could maybe discuss implementing something like this: https://github.com/huggingface/candle-rotary (But changing the implementation around so we could add metal backend in it).
Ping on the refactoring into a first smaller PR as mentioned above. #1534 was actually doing something like this but not sure why it was closed. |
@LaurentMazare have you looked at the PR ? Aside from the gigantic file (copy pasted from GGML without any modification) the PR is actually quite small. The only real logic is in candle_metal_kernels::call_ggml_gemm and it's also relatively small (it's just small logic to adapt the warps for the kernels as they are written in GGML file) |
I have looked at it indeed (and a bit in details), hence my suggestion to split the new kernels aside from a simple PR that would just add the device as a parameter for the quantized var store, it would make it a lot easier to read. Also there seems to be some other unrelated changes that I would prefer to see in isolation. |
- Add a device param, wherever needed. - Create new QMetal storage thing that implements QuantizedType. - Update everywhere needed. Fix Python. Fixing examples. Fix: fmt + clippy + stub. Moving everything around. Only missing the actual implems. Fixing everything + adding dequantized kernels. More work. Fixing matmul. Fmt + Clippy Some clippy fixes. Working state. Q2K Metal -> Bugged (also present in GGML). Q4K CPU -> Bugged (present previously, new test catch it). Q5K CPU -> Bugged (present previously). Q8_1 Both -> Never really implemented it seems Q8K metal -> Never implemented in metal Fixing Q2K bug (present in ggml).
Here it is https://github.com/huggingface/candle/pull/1594/files I closed the previous "small" PR because I had to switch a lot of things around while actually implementing. Mostly because That also means moving away for the generics calls ( I benched on my computer, it doesn't seem to make such a difference.) |
Okay I'm going to go ahead and merge this. It has been reviewed by @FL33TW00D. @LaurentMazare all your comments on this PR and the smaller one are just about splitting PRs, not really some conceptual problem with it in any way, therefore I'll assume everything is actually mostly ok, and we can just fix whatever is wrong in other PRs if needed. |
Please don't merge things like this without adressing my review comments first. |
Working quantized state for candle.
High level overview:
I think we should keep the surface for on metal quantize/dequantize so we can easily implement them later. They are part of the ggml API imho.
test_device!
in order to get similar testing behavior as regularTensor
.quantized.metal
is a direct copy of ggml'sggml-metal.metal
. This choice was made so further dev could be made faster and bugs mentionned after can be imported more easily. All the glue logic is incandle_metal_kernels
.candle_metal_kernels
. Ggml uses different kernels based on size of matmul and hardware capacity. This wasn't implemented here, but could with the current API.Worthy bug already discovered (not fixed in this PR since they do not belong here):