Llama tokenizer newline character inconsistency #35923

ingo-m · 2025-01-28T03:36:35Z

transformers version: 4.48.1
Platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.39
Python version: 3.12.8
Huggingface_hub version: 0.27.1
Safetensors version: 0.5.2
Accelerate version: 1.3.0
Accelerate config: not found
PyTorch version (GPU?): 2.5.1+cu124 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: no

This issue relates to tokenizers: @ArthurZucker and @itazap

The Llama 3 documentation says that "Newlines (0x0A) are part of the prompt format", so I guess this is important when tokenizing.

Source:
https://www.llama.com/docs/model-cards-and-prompt-formats/meta-llama-3/#llama-3-instruct

I observed the following:

from transformers import AutoTokenizer

model_id = "meta-llama/Llama-3.2-1B"

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Normal string

newline_token = tokenizer("\n")["input_ids"]
print(newline_token)
# Single newline in normal string:
# [128000, 198]

newline_token = tokenizer("example\nexample")["input_ids"]
print(newline_token)
# Newline with context in normal string:
# [128000, 8858, 198, 8858]

newline_token = tokenizer("\n\n")["input_ids"]
print(newline_token)
# Two newlines in normal string:
# [128000, 271]

newline_token = tokenizer("\\n")["input_ids"]
print(newline_token)
# Double escaped newline in normal string:
# [128000, 1734]

newline_token = tokenizer("example\\nexample")["input_ids"]
print(newline_token)
# Double escapes newline in normal string with context:
# [128000, 8858, 1734, 8858]

# Raw string

newline_token = tokenizer(r"\n")["input_ids"]
print(newline_token)
# Single newline in raw string:
# [128000, 1734]

newline_token = tokenizer(r"\n\n")["input_ids"]
print(newline_token)
# Double newline in raw string:
# [128000, 1734, 1734]

newline_token = tokenizer(r"example\nexample")["input_ids"]
print(newline_token)
# Newline in raw string with context:
# [128000, 8858, 1734, 8858]

Now I'm really wondering what's the correct way to represent newlines for a Llama tokenizer.

Using a "normal" string, two consecutive newlines ("\n\n") get represented as 271, vs. "\n" as 198. The prompt template from the website contains double newlines, but it is not clear how they get tokenized.

The raw string r"\n" seems more consistent, the newline gets represented as token 1734 irrespective of context.

So is r"\n" the way the Llama 3 tokenizer expects newlines? Or do the double newlines in the prompt template correspond to token 271?

Any insights into this would be much appreciated. Thanks.

Related (but closed) issue: #31030

The text was updated successfully, but these errors were encountered:

ingo-m · 2025-01-28T12:09:19Z

Another related issue: huggingface/transformers.js#1019

ingo-m mentioned this issue Jan 28, 2025

Pretrained Llama tokenizers don't yield the expected tokenization of "\n" huggingface/transformers.js#1019

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama tokenizer newline character inconsistency #35923

Llama tokenizer newline character inconsistency #35923

ingo-m commented Jan 28, 2025

ingo-m commented Jan 28, 2025

Llama tokenizer newline character inconsistency #35923

Llama tokenizer newline character inconsistency #35923

Comments

ingo-m commented Jan 28, 2025

ingo-m commented Jan 28, 2025