tokenizer.decode() and tokenizer.convert_ids_to_tokens() return different results #35641

thangld201 · 2025-01-12T17:17:06Z

System Info

In [2]: tokenizers.__version__, transformers.__version__
Out[2]: ('0.21.0', '4.47.0')

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
token_id=124
in_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
print([in_tokenizer.convert_ids_to_tokens([token_id])[0], in_tokenizer.decode(token_id)])


token_id=124
in_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
print([in_tokenizer.convert_ids_to_tokens(token_id), in_tokenizer.decode(token_id)])

Expected behavior

# Output
['À', '�']
['À', '�']

# Expected Output
['À', 'À']
['À', 'À']

The text was updated successfully, but these errors were encountered:

sambhavnoobcoder · 2025-01-12T21:40:39Z

this is an issue from the Qwen tokeniser and the discrepancy in the way it treats special characters . I have made a fix and pushed a PR in #35643 . @thangld201 you can give it a look and maintainers @Rocketknight1 @ArthurZucker also please look at it as reviews are appreciated .

tdard · 2025-01-13T08:27:05Z

As of right now, you will observe the discrepancy only on other characters than utf-8 (here, À is in latin1).

The culprit is the following method of the Qwen2Tokenizer that's being called in the decode method:

    def convert_tokens_to_string(self, tokens):
        """Converts a sequence of tokens (string) in a single string."""
        text = "".join(tokens)
        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
        return text

I'd say this could be fixed by following a similar approach to other tokenizers such as Llama's, but I'm not so sure about the approach. I'd be willing to help if someone knowledgeable could pinpoint whether that's correct or not !

thangld201 · 2025-01-13T08:30:53Z

@sambhavnoobcoder @tdard This issue also happens with Llama 3.1, I updated the example code. Probably also present in other models (but I haven't confirmed yet).

ArthurZucker · 2025-01-13T13:00:21Z

Hey, this is not an issue, it's expected

ArthurZucker · 2025-01-13T13:01:32Z

These tokenizers use ByteLevel encoding, meaning the tokens are byte representation of non utf8 characters. SO, when you ask to convert_ids_to_tokens you get the string representation, which is not possible to decode to utf8. THat's why when you ask to decode it fails 😉

thangld201 · 2025-01-13T13:08:57Z

@ArthurZucker Isn't the output misleading ? Those token ids map to valid characters but when decoded they all turned into weird replacement characters.

ArthurZucker · 2025-01-13T13:48:01Z

They are not valid characters, they are valid UTF-8 representation of characters. But the token is not actually À. I am not sure how missleading it is, we can add documentation about it if you want, but the matter of the fact is that those characters are the same as what b'\x80\x81\x82'.decode(errors="replace") would give: '��'

thangld201 · 2025-01-13T13:51:20Z

@ArthurZucker Thank you for clarifying! But why don't we convert them to utf-8 in .decode() ? Why display '�' ?

ArthurZucker · 2025-01-16T15:08:57Z

Again, because decode outputs strings not string representation of tokens. If you want the display the representation then you use convert_tokens_to_ids

thangld201 added the bug label Jan 12, 2025

sambhavnoobcoder mentioned this issue Jan 12, 2025

Fix Inconsistent Token Decoding in Qwen2 Tokenizer #35641 #35643

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizer.decode() and tokenizer.convert_ids_to_tokens() return different results #35641

tokenizer.decode() and tokenizer.convert_ids_to_tokens() return different results #35641

thangld201 commented Jan 12, 2025 •

edited

Loading

sambhavnoobcoder commented Jan 12, 2025

tdard commented Jan 13, 2025

thangld201 commented Jan 13, 2025

ArthurZucker commented Jan 13, 2025

ArthurZucker commented Jan 13, 2025

thangld201 commented Jan 13, 2025

ArthurZucker commented Jan 13, 2025

thangld201 commented Jan 13, 2025

ArthurZucker commented Jan 16, 2025

tokenizer.decode() and tokenizer.convert_ids_to_tokens() return different results #35641

tokenizer.decode() and tokenizer.convert_ids_to_tokens() return different results #35641

Comments

thangld201 commented Jan 12, 2025 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

sambhavnoobcoder commented Jan 12, 2025

tdard commented Jan 13, 2025

thangld201 commented Jan 13, 2025

ArthurZucker commented Jan 13, 2025

ArthurZucker commented Jan 13, 2025

thangld201 commented Jan 13, 2025

ArthurZucker commented Jan 13, 2025

thangld201 commented Jan 13, 2025

ArthurZucker commented Jan 16, 2025

thangld201 commented Jan 12, 2025 •

edited

Loading