You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fromtransformersimportAutoTokenizermodel_id="meta-llama/Llama-3.2-1B"tokenizer=AutoTokenizer.from_pretrained(model_id)
# Normal stringnewline_token=tokenizer("\n")["input_ids"]
print(newline_token)
# Single newline in normal string:# [128000, 198]newline_token=tokenizer("example\nexample")["input_ids"]
print(newline_token)
# Newline with context in normal string:# [128000, 8858, 198, 8858]newline_token=tokenizer("\n\n")["input_ids"]
print(newline_token)
# Two newlines in normal string:# [128000, 271]newline_token=tokenizer("\\n")["input_ids"]
print(newline_token)
# Double escaped newline in normal string:# [128000, 1734]newline_token=tokenizer("example\\nexample")["input_ids"]
print(newline_token)
# Double escapes newline in normal string with context:# [128000, 8858, 1734, 8858]# Raw stringnewline_token=tokenizer(r"\n")["input_ids"]
print(newline_token)
# Single newline in raw string:# [128000, 1734]newline_token=tokenizer(r"\n\n")["input_ids"]
print(newline_token)
# Double newline in raw string:# [128000, 1734, 1734]newline_token=tokenizer(r"example\nexample")["input_ids"]
print(newline_token)
# Newline in raw string with context:# [128000, 8858, 1734, 8858]
Now I'm really wondering what's the correct way to represent newlines for a Llama tokenizer.
Using a "normal" string, two consecutive newlines ("\n\n") get represented as 271, vs. "\n" as 198. The prompt template from the website contains double newlines, but it is not clear how they get tokenized.
The raw string r"\n" seems more consistent, the newline gets represented as token 1734 irrespective of context.
So is r"\n" the way the Llama 3 tokenizer expects newlines? Or do the double newlines in the prompt template correspond to token 271?
Any insights into this would be much appreciated. Thanks.
transformers
version: 4.48.1This issue relates to tokenizers: @ArthurZucker and @itazap
The Llama 3 documentation says that "Newlines (0x0A) are part of the prompt format", so I guess this is important when tokenizing.
Source:
https://www.llama.com/docs/model-cards-and-prompt-formats/meta-llama-3/#llama-3-instruct
I observed the following:
Now I'm really wondering what's the correct way to represent newlines for a Llama tokenizer.
Using a "normal" string, two consecutive newlines ("\n\n") get represented as 271, vs. "\n" as 198. The prompt template from the website contains double newlines, but it is not clear how they get tokenized.
The raw string r"\n" seems more consistent, the newline gets represented as token 1734 irrespective of context.
So is r"\n" the way the Llama 3 tokenizer expects newlines? Or do the double newlines in the prompt template correspond to token 271?
Any insights into this would be much appreciated. Thanks.
Related (but closed) issue: #31030
The text was updated successfully, but these errors were encountered: