-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokenizer.decode() and tokenizer.convert_ids_to_tokens() return different results #35641
Comments
this is an issue from the Qwen tokeniser and the discrepancy in the way it treats special characters . I have made a fix and pushed a PR in #35643 . @thangld201 you can give it a look and maintainers @Rocketknight1 @ArthurZucker also please look at it as reviews are appreciated . |
As of right now, you will observe the discrepancy only on other characters than The culprit is the following method of the def convert_tokens_to_string(self, tokens):
"""Converts a sequence of tokens (string) in a single string."""
text = "".join(tokens)
text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
return text I'd say this could be fixed by following a similar approach to other tokenizers such as Llama's, but I'm not so sure about the approach. I'd be willing to help if someone knowledgeable could pinpoint whether that's correct or not ! |
@sambhavnoobcoder @tdard This issue also happens with Llama 3.1, I updated the example code. Probably also present in other models (but I haven't confirmed yet). |
Hey, this is not an issue, it's expected |
These tokenizers use |
@ArthurZucker Isn't the output misleading ? Those token ids map to valid characters but when decoded they all turned into weird replacement characters. |
They are not valid characters, they are valid UTF-8 representation of characters. But the token is not actually À. I am not sure how missleading it is, we can add documentation about it if you want, but the matter of the fact is that those characters are the same as what |
@ArthurZucker Thank you for clarifying! But why don't we convert them to utf-8 in .decode() ? Why display '�' ? |
Again, because decode outputs |
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
The text was updated successfully, but these errors were encountered: