Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix IDEFICS dtype #1214

Merged
merged 3 commits into from
Nov 23, 2023
Merged

Fix IDEFICS dtype #1214

merged 3 commits into from
Nov 23, 2023

Conversation

vakker
Copy link
Contributor

@vakker vakker commented Oct 31, 2023

What does this PR do?

This forces the use of bfloat16 for IDEFICS. The issue is that with float16 the 80b model gives garbage output. Let me know if this solution is not appropriate and I'll adjust accordingly. For the details see below.

The current behaviour:

$ curl 127.0.0.1:8080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' -H 'Content-Type: application/json'
{"generated_text":""}

On closer inspection with:

import requests

headers = { "Content-Type": "application/json"}

query = "What is Deep Learning?"
data = {
    "inputs": query,
    "parameters": {
        "max_new_tokens": 10,
        "return_full_text": True,
        "decoder_input_details": True,
        "do_sample": False,
    },
}

api_url = "http://127.0.0.1:8080"
response = requests.post(api_url + "/generate", headers=headers, json=data).json()

for i in ['prefill', 'tokens']:
    print(f'### {i}')
    print(repr(''.join([t['text'] for t in response['details'][i]])))

Prints:

### prefill
'<s>WhatisDeepLearning?'
### tokens
'<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>'
########

With the change in this PR it prints:

### prefill
'<s>WhatisDeepLearning?'
### tokens
'\n\nDeep Learning is a subset of machine'

Note, using the Transformers implementation (with IdeficsForVisionText2Text.from_pretrained) produces the latter (correct) output as well.
This only happens with the 80b model, the 9b model is not as sensitive to the dtype (as also mentioned in the code).

The reason for "forcing" this in the IDEFICS init method, is because if quantization is used, then the dtype cannot be set explicitly. And since it's left as None, it's set to float16 by default here. I.e. there's no other way to manually change the dtype if someone is using quantization:

$ docker run .... ghcr.io/huggingface/text-generation-inference:latest --model-id HuggingFaceM4/idefics-80b-instruct --dtype bfloat16 --quantize bitsandbytes-nf4
.....
2023-10-31T12:42:26.710401Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-10-31T12:42:30.315734Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 80, in serve
    raise RuntimeError(

RuntimeError: Only 1 can be set between `dtype` and `quantize`, as they both decide how goes the final model.
 rank=0
Error: ShardCannotStart
2023-10-31T12:42:30.414010Z ERROR text_generation_launcher: Shard 0 failed to start
2023-10-31T12:42:30.414044Z  INFO text_generation_launcher: Shutting down shards

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil what do you think?

@Narsil
Copy link
Collaborator

Narsil commented Nov 3, 2023

Hey I checked it out.

Idefics works perfectly fine in both f16 and bf16.
The issue only seems to arise with bitsandbytes-nf4 so I'm guessing the issue is upstream, don't you think ? I mean users shouldn't have to guess if it's float16 or bfloat16 that they should use (and some models are better in float16 for stability, probably idefics too).

@vakker
Copy link
Contributor Author

vakker commented Nov 3, 2023

@Narsil thanks for looking into it. Could you provide the exact steps that you used to check?

I tested it without the quantization using:

docker run --gpus all --shm-size 1g -p 9080:80 -v /home/user/.cache/huggingface/hub:/data ghcr.io/huggingface/text-generation-inference:latest --model-id HuggingFaceM4/idefics-80b-instruct --sharded true --num-shard 8

That gave the same result that I provided in the description: {"generated_text":""}.

To compare, I also tried with bfloat16, like:

docker run --gpus all --shm-size 1g -p 9080:80 -v /home/user/.cache/huggingface/hub:/data ghcr.io/huggingface/text-generation-inference:latest --model-id HuggingFaceM4/idefics-80b-instruct --sharded true --num-shard 8 --dtype bfloat16

That gave the same result that I provided in the description: {"generated_text":"\nDeep Learning is a subset of Machine Learning"}.

Also, to check other quantization, I tried eetq with:

docker run --gpus all --shm-size 1g -p 9080:80 -v /home/user/.cache/huggingface/hub:/data ghcr.io/huggingface/text-generation-inference:latest --model-id HuggingFaceM4/idefics-80b-instruct --sharded true --num-shard 4 --quantize eetq

That gave the same result that I provided in the description: {"generated_text":""}.

I also made sure that it's the latest Docker image (sha256:298404d862962d6d497a1d1aebb433a37d4e6fae96f2356c2f38b64ee844f15c) and the latest model revision (a14d258b1be2a74a3604483de552c33121a98391).

Regarding the "I mean users shouldn't have to guess if it's float16 or bfloat16 that they should use (and some models are better in float16 for stability, probably idefics too).", I agree, it's quite inconvenient, but I'm not sure that's true. The current implementation explicitly states the opposite (see here):

9b seems to work correctly enough in float16, but 80b seems to be really saturating for f16.

I have also seen cases where bfloat16 certainly worked better than float16.

So, please let me know your testing procedure, in case I missed anything important.

@vakker
Copy link
Contributor Author

vakker commented Nov 15, 2023

@Narsil do you need more information to narrow down the issue?

@Narsil
Copy link
Collaborator

Narsil commented Nov 23, 2023

Sorry, very busy elsewhere this last few weeks.

I indeed was able to reproduce today, although last time I'm sure I also tested it and couldn't find any issue. Could be some cuda version or something.

In any case 80B does seem to suffer in float16 in inference at least in some circumstances, happy to change the default.
I modified your PR so we still allow users to choose which dtype they want, but we changed the default to bfloat16 so at least it shouldn't required any configuration to get correct behavior.

@Narsil Narsil merged commit c6bb767 into huggingface:main Nov 23, 2023
2 of 6 checks passed
@vakker
Copy link
Contributor Author

vakker commented Nov 26, 2023

@Narsil thanks for looking into it and verifying the issue.

As I mentioned in the original post:

The reason for "forcing" this in the IDEFICS init method, is because if quantization is used, then the dtype cannot be set explicitly. And since it's left as None, it's set to float16 by default here.

I.e. the way how you changed it back will still not work if any of the quantization options are used, so this issue is still not resolved.

You can use the same reproduction commands that I included before to verify this.

@Narsil
Copy link
Collaborator

Narsil commented Nov 27, 2023

Indeed, the default is overridden earlier: #1287

@Narsil
Copy link
Collaborator

Narsil commented Nov 27, 2023

Problem is worse than I thought.

9B gives garbage with bfloat16 and works fine with float16
80B is the other way around.

Quantization only work with float16 (definitely the case for all quantization methods aside of potentially bnb, bnb is slow though). Seems like the previous version of the code was the best.

I'll run a few more experiments to try a good default that would fit both cases without having explicit model name if possible.

@vakker
Copy link
Contributor Author

vakker commented Nov 29, 2023

9B gives garbage with bfloat16 and works fine with float16

Could you provide the exact steps that you used to test that? I cannot reproduce that, for me 9B works the same way with bfloat16 as expected.
For me I used:

docker run --gpus all --shm-size 1g -p 9080:80 -v /home/user/.cache/huggingface/hub:/data ghcr.io/huggingface/text-generation-inference:latest --model-id HuggingFaceM4/idefics-9b-instruct --dtype bfloat16

Then

curl localhost:9080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":10}}' -H 'Content-Type: application/json'

Which returned:

{"generated_text":"\n\nDeep learning is a subset of machine"}

The #1287 PR helps, at least the default dtype is not forced in cli.py.

@vakker vakker deleted the fix-idefics-dtype branch November 29, 2023 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants