-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only random noise is generated with Flux + LoRA with optimum-quanto >= 0.2.5 #343
Comments
I encountered a similar issue. When using optimum-quanto==0.2.6 to quantize FLUX.1-schnell, the output also turned into random noise. After investigating, I found that the issue was caused by MarlinF8QBytesTensor. To fix it, you can modify optimum/quanto/tensor/weights/qbytes.py. Simply change the line:
But I don't know why MarlinF8QBytesTensor can‘t work. @dacorvo |
@tyyff thank you for investigating this. See also #332. There might be a general issue with Marlin kernels when the size of the tensors involved in the matmul increases (could be an overflow, could be some overlaps in the intermediate result buffers, I really don't know). I will disable the FP8 kernel for now. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Do you know why the Hyper-FLUX.1-dev-8steps-lora is not effective in Flux-dev-fp8 |
A race condition in the GPTQ Marlin Kernel has been fixed: vllm-project/vllm#11493. |
Hello,
I am facing an issue with generating images with FLUX.1[dev] + LoRA that I trained with SimpleTuner. I need to be able to load the LoRAs dynamically, therefore I want to use the already quantized FLUX before the LoRA is loaded into it. With optimum-quanto version 0.2.4 and lower I got the following error:
KeyError: 'time_text_embed.timestep_embedder.linear_1.weight._data’
. After bumping the version to 0.2.5 or 0.2.6, no error is thrown but the results look like this:My code:
Is there a way how to solve this? A workaround could be to load the LoRA into the model before quantization and save the quantized merged model and work with that, but I lose the benefit of working with the LoRA only, which is much faster and less memory expensive.
Thanks!
The text was updated successfully, but these errors were encountered: