Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer.decode() and tokenizer.convert_ids_to_tokens() return different results #35641

Open
4 tasks
thangld201 opened this issue Jan 12, 2025 · 9 comments
Open
4 tasks
Labels

Comments

@thangld201
Copy link

thangld201 commented Jan 12, 2025

System Info

In [2]: tokenizers.__version__, transformers.__version__
Out[2]: ('0.21.0', '4.47.0')

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
token_id=124
in_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
print([in_tokenizer.convert_ids_to_tokens([token_id])[0], in_tokenizer.decode(token_id)])


token_id=124
in_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
print([in_tokenizer.convert_ids_to_tokens(token_id), in_tokenizer.decode(token_id)])

Expected behavior

# Output
['À', '�']
['À', '�']

# Expected Output
['À', 'À']
['À', 'À']
@sambhavnoobcoder
Copy link

this is an issue from the Qwen tokeniser and the discrepancy in the way it treats special characters . I have made a fix and pushed a PR in #35643 . @thangld201 you can give it a look and maintainers @Rocketknight1 @ArthurZucker also please look at it as reviews are appreciated .

@tdard
Copy link

tdard commented Jan 13, 2025

As of right now, you will observe the discrepancy only on other characters than utf-8 (here, À is in latin1).

The culprit is the following method of the Qwen2Tokenizer that's being called in the decode method:

    def convert_tokens_to_string(self, tokens):
        """Converts a sequence of tokens (string) in a single string."""
        text = "".join(tokens)
        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
        return text

I'd say this could be fixed by following a similar approach to other tokenizers such as Llama's, but I'm not so sure about the approach. I'd be willing to help if someone knowledgeable could pinpoint whether that's correct or not !

@thangld201
Copy link
Author

@sambhavnoobcoder @tdard This issue also happens with Llama 3.1, I updated the example code. Probably also present in other models (but I haven't confirmed yet).

@ArthurZucker
Copy link
Collaborator

Hey, this is not an issue, it's expected

@ArthurZucker
Copy link
Collaborator

These tokenizers use ByteLevel encoding, meaning the tokens are byte representation of non utf8 characters. SO, when you ask to convert_ids_to_tokens you get the string representation, which is not possible to decode to utf8. THat's why when you ask to decode it fails 😉

@thangld201
Copy link
Author

@ArthurZucker Isn't the output misleading ? Those token ids map to valid characters but when decoded they all turned into weird replacement characters.

@ArthurZucker
Copy link
Collaborator

They are not valid characters, they are valid UTF-8 representation of characters. But the token is not actually À. I am not sure how missleading it is, we can add documentation about it if you want, but the matter of the fact is that those characters are the same as what b'\x80\x81\x82'.decode(errors="replace") would give: '���'

@thangld201
Copy link
Author

@ArthurZucker Thank you for clarifying! But why don't we convert them to utf-8 in .decode() ? Why display '�' ?

@ArthurZucker
Copy link
Collaborator

Again, because decode outputs strings not string representation of tokens. If you want the display the representation then you use convert_tokens_to_ids

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants