-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix IDEFICS dtype #1214
Fix IDEFICS dtype #1214
Conversation
Hey I checked it out. Idefics works perfectly fine in both f16 and bf16. |
@Narsil thanks for looking into it. Could you provide the exact steps that you used to check? I tested it without the quantization using: docker run --gpus all --shm-size 1g -p 9080:80 -v /home/user/.cache/huggingface/hub:/data ghcr.io/huggingface/text-generation-inference:latest --model-id HuggingFaceM4/idefics-80b-instruct --sharded true --num-shard 8 That gave the same result that I provided in the description: To compare, I also tried with docker run --gpus all --shm-size 1g -p 9080:80 -v /home/user/.cache/huggingface/hub:/data ghcr.io/huggingface/text-generation-inference:latest --model-id HuggingFaceM4/idefics-80b-instruct --sharded true --num-shard 8 --dtype bfloat16 That gave the same result that I provided in the description: Also, to check other quantization, I tried docker run --gpus all --shm-size 1g -p 9080:80 -v /home/user/.cache/huggingface/hub:/data ghcr.io/huggingface/text-generation-inference:latest --model-id HuggingFaceM4/idefics-80b-instruct --sharded true --num-shard 4 --quantize eetq That gave the same result that I provided in the description: I also made sure that it's the latest Docker image ( Regarding the "I mean users shouldn't have to guess if it's float16 or bfloat16 that they should use (and some models are better in float16 for stability, probably idefics too).", I agree, it's quite inconvenient, but I'm not sure that's true. The current implementation explicitly states the opposite (see here):
I have also seen cases where So, please let me know your testing procedure, in case I missed anything important. |
@Narsil do you need more information to narrow down the issue? |
This reverts commit b8952b2.
Sorry, very busy elsewhere this last few weeks. I indeed was able to reproduce today, although last time I'm sure I also tested it and couldn't find any issue. Could be some cuda version or something. In any case 80B does seem to suffer in float16 in inference at least in some circumstances, happy to change the default. |
@Narsil thanks for looking into it and verifying the issue. As I mentioned in the original post:
I.e. the way how you changed it back will still not work if any of the quantization options are used, so this issue is still not resolved. You can use the same reproduction commands that I included before to verify this. |
Indeed, the default is overridden earlier: #1287 |
Problem is worse than I thought. 9B gives garbage with bfloat16 and works fine with float16 Quantization only work with float16 (definitely the case for all quantization methods aside of potentially bnb, bnb is slow though). Seems like the previous version of the code was the best. I'll run a few more experiments to try a good default that would fit both cases without having explicit model name if possible. |
Could you provide the exact steps that you used to test that? I cannot reproduce that, for me 9B works the same way with docker run --gpus all --shm-size 1g -p 9080:80 -v /home/user/.cache/huggingface/hub:/data ghcr.io/huggingface/text-generation-inference:latest --model-id HuggingFaceM4/idefics-9b-instruct --dtype bfloat16 Then curl localhost:9080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":10}}' -H 'Content-Type: application/json' Which returned:
The #1287 PR helps, at least the default |
What does this PR do?
This forces the use of
bfloat16
for IDEFICS. The issue is that withfloat16
the 80b model gives garbage output. Let me know if this solution is not appropriate and I'll adjust accordingly. For the details see below.The current behaviour:
On closer inspection with:
Prints:
With the change in this PR it prints:
Note, using the Transformers implementation (with
IdeficsForVisionText2Text.from_pretrained
) produces the latter (correct) output as well.This only happens with the 80b model, the 9b model is not as sensitive to the dtype (as also mentioned in the code).
The reason for "forcing" this in the IDEFICS init method, is because if quantization is used, then the dtype cannot be set explicitly. And since it's left as
None
, it's set tofloat16
by default here. I.e. there's no other way to manually change the dtype if someone is using quantization:Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@Narsil what do you think?