Fix IDEFICS dtype #1214

vakker · 2023-10-31T12:52:22Z

What does this PR do?

This forces the use of bfloat16 for IDEFICS. The issue is that with float16 the 80b model gives garbage output. Let me know if this solution is not appropriate and I'll adjust accordingly. For the details see below.

The current behaviour:

$ curl 127.0.0.1:8080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' -H 'Content-Type: application/json'
{"generated_text":""}

On closer inspection with:

import requests

headers = { "Content-Type": "application/json"}

query = "What is Deep Learning?"
data = {
    "inputs": query,
    "parameters": {
        "max_new_tokens": 10,
        "return_full_text": True,
        "decoder_input_details": True,
        "do_sample": False,
    },
}

api_url = "http://127.0.0.1:8080"
response = requests.post(api_url + "/generate", headers=headers, json=data).json()

for i in ['prefill', 'tokens']:
    print(f'### {i}')
    print(repr(''.join([t['text'] for t in response['details'][i]])))

Prints:

### prefill
'<s>WhatisDeepLearning?'
### tokens
'<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>'
########

With the change in this PR it prints:

### prefill
'<s>WhatisDeepLearning?'
### tokens
'\n\nDeep Learning is a subset of machine'

Note, using the Transformers implementation (with IdeficsForVisionText2Text.from_pretrained) produces the latter (correct) output as well.
This only happens with the 80b model, the 9b model is not as sensitive to the dtype (as also mentioned in the code).

The reason for "forcing" this in the IDEFICS init method, is because if quantization is used, then the dtype cannot be set explicitly. And since it's left as None, it's set to float16 by default here. I.e. there's no other way to manually change the dtype if someone is using quantization:

$ docker run .... ghcr.io/huggingface/text-generation-inference:latest --model-id HuggingFaceM4/idefics-80b-instruct --dtype bfloat16 --quantize bitsandbytes-nf4
.....
2023-10-31T12:42:26.710401Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-10-31T12:42:30.315734Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 80, in serve
    raise RuntimeError(

RuntimeError: Only 1 can be set between `dtype` and `quantize`, as they both decide how goes the final model.
 rank=0
Error: ShardCannotStart
2023-10-31T12:42:30.414010Z ERROR text_generation_launcher: Shard 0 failed to start
2023-10-31T12:42:30.414044Z  INFO text_generation_launcher: Shutting down shards

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil what do you think?

Narsil · 2023-11-03T08:27:56Z

Hey I checked it out.

Idefics works perfectly fine in both f16 and bf16.
The issue only seems to arise with bitsandbytes-nf4 so I'm guessing the issue is upstream, don't you think ? I mean users shouldn't have to guess if it's float16 or bfloat16 that they should use (and some models are better in float16 for stability, probably idefics too).

vakker · 2023-11-03T10:40:06Z

@Narsil thanks for looking into it. Could you provide the exact steps that you used to check?

I tested it without the quantization using:

docker run --gpus all --shm-size 1g -p 9080:80 -v /home/user/.cache/huggingface/hub:/data ghcr.io/huggingface/text-generation-inference:latest --model-id HuggingFaceM4/idefics-80b-instruct --sharded true --num-shard 8

That gave the same result that I provided in the description: {"generated_text":""}.

To compare, I also tried with bfloat16, like:

docker run --gpus all --shm-size 1g -p 9080:80 -v /home/user/.cache/huggingface/hub:/data ghcr.io/huggingface/text-generation-inference:latest --model-id HuggingFaceM4/idefics-80b-instruct --sharded true --num-shard 8 --dtype bfloat16

That gave the same result that I provided in the description: {"generated_text":"\nDeep Learning is a subset of Machine Learning"}.

Also, to check other quantization, I tried eetq with:

docker run --gpus all --shm-size 1g -p 9080:80 -v /home/user/.cache/huggingface/hub:/data ghcr.io/huggingface/text-generation-inference:latest --model-id HuggingFaceM4/idefics-80b-instruct --sharded true --num-shard 4 --quantize eetq

That gave the same result that I provided in the description: {"generated_text":""}.

I also made sure that it's the latest Docker image (sha256:298404d862962d6d497a1d1aebb433a37d4e6fae96f2356c2f38b64ee844f15c) and the latest model revision (a14d258b1be2a74a3604483de552c33121a98391).

Regarding the "I mean users shouldn't have to guess if it's float16 or bfloat16 that they should use (and some models are better in float16 for stability, probably idefics too).", I agree, it's quite inconvenient, but I'm not sure that's true. The current implementation explicitly states the opposite (see here):

9b seems to work correctly enough in float16, but 80b seems to be really saturating for f16.

I have also seen cases where bfloat16 certainly worked better than float16.

So, please let me know your testing procedure, in case I missed anything important.

vakker · 2023-11-15T17:18:34Z

@Narsil do you need more information to narrow down the issue?

This reverts commit b8952b2.

Narsil · 2023-11-23T14:00:05Z

Sorry, very busy elsewhere this last few weeks.

I indeed was able to reproduce today, although last time I'm sure I also tested it and couldn't find any issue. Could be some cuda version or something.

In any case 80B does seem to suffer in float16 in inference at least in some circumstances, happy to change the default.
I modified your PR so we still allow users to choose which dtype they want, but we changed the default to bfloat16 so at least it shouldn't required any configuration to get correct behavior.

vakker · 2023-11-26T12:57:36Z

@Narsil thanks for looking into it and verifying the issue.

As I mentioned in the original post:

The reason for "forcing" this in the IDEFICS init method, is because if quantization is used, then the dtype cannot be set explicitly. And since it's left as None, it's set to float16 by default here.

I.e. the way how you changed it back will still not work if any of the quantization options are used, so this issue is still not resolved.

You can use the same reproduction commands that I included before to verify this.

Narsil · 2023-11-27T10:31:22Z

Indeed, the default is overridden earlier: #1287

Narsil · 2023-11-27T14:28:53Z

Problem is worse than I thought.

9B gives garbage with bfloat16 and works fine with float16
80B is the other way around.

Quantization only work with float16 (definitely the case for all quantization methods aside of potentially bnb, bnb is slow though). Seems like the previous version of the code was the best.

I'll run a few more experiments to try a good default that would fit both cases without having explicit model name if possible.

vakker · 2023-11-29T11:27:41Z

9B gives garbage with bfloat16 and works fine with float16

Could you provide the exact steps that you used to test that? I cannot reproduce that, for me 9B works the same way with bfloat16 as expected.
For me I used:

docker run --gpus all --shm-size 1g -p 9080:80 -v /home/user/.cache/huggingface/hub:/data ghcr.io/huggingface/text-generation-inference:latest --model-id HuggingFaceM4/idefics-9b-instruct --dtype bfloat16

Then

curl localhost:9080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":10}}' -H 'Content-Type: application/json'

Which returned:

{"generated_text":"\n\nDeep learning is a subset of machine"}

The #1287 PR helps, at least the default dtype is not forced in cli.py.

Idefics force bfloat16

b8952b2

Narsil added 2 commits November 23, 2023 13:57

Revert "Idefics force bfloat16"

861acde

This reverts commit b8952b2.

Fixing Idefics dtype.

a8da815

Narsil merged commit c6bb767 into huggingface:main Nov 23, 2023
2 of 6 checks passed

vakker deleted the fix-idefics-dtype branch November 29, 2023 11:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix IDEFICS dtype #1214

Fix IDEFICS dtype #1214

vakker commented Oct 31, 2023

Narsil commented Nov 3, 2023

vakker commented Nov 3, 2023

vakker commented Nov 15, 2023

Narsil commented Nov 23, 2023

vakker commented Nov 26, 2023

Narsil commented Nov 27, 2023

Narsil commented Nov 27, 2023

vakker commented Nov 29, 2023

Fix IDEFICS dtype #1214

Fix IDEFICS dtype #1214

Conversation

vakker commented Oct 31, 2023

What does this PR do?

Before submitting

Who can review?

Narsil commented Nov 3, 2023

vakker commented Nov 3, 2023

vakker commented Nov 15, 2023

Narsil commented Nov 23, 2023

vakker commented Nov 26, 2023

Narsil commented Nov 27, 2023

Narsil commented Nov 27, 2023

vakker commented Nov 29, 2023