You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m working with a LLaMA-based model that has a LoRA (Low-Rank Adapter) applied, and I’m using beam search in Transformers. I’m trying to debug how the final beam scores are computed, because the step-by-step log probabilities I print out look far more negative than the final “sequence score” reported by Hugging Face.
Below is a sample of my debug output for 4 beams, each showing:
Generated Sequence (token IDs, excluding the prompt/input).
Generated Text (decoded).
Step-by-Step Analysis: Each newly generated token’s log probability.
HF Cumulative Sequence Score (final beam score from generation_output.sequences_scores).
Debug Info (lengths, how many log-prob steps were used vs. available).
Final Scores:
HF Cumulative Sequence Score: -1.464120
The Question
How does Hugging Face’s beam search compute the final scores (e.g., −0.247081, −0.323745, −1.447294, −1.464120) given the very negative individual log probabilities?
For example, for the first beam, I expected a cumulative probability of (-0.741240 - 28.38378 - 32.667973) / 3 = -20.597667 since no length_penalty is being applied. However, the final sequences_scores from HF differ significantly from any straightforward summation of the listed token log-probs, even when accounting for a length_penalty.
Can someone help clarify how these scores are calculated?
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
GENERATION CODE :
------------------------------------------------------------------------------------------------------------------------
model_name = "./Llama/Meta-Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = LlamaForCausalLM.from_pretrained(
model_name,
load_in_8bit=False,
torch_dtype=torch.float16,
device_map='auto',
)
adaptor_path = './model_spec/checkpoints/checkpoint-200'
model = PeftModel.from_pretrained(
model,
adaptor_path,
torch_dtype=torch.float16,
)
model.eval()
message = "Lady Sold Children's Clothes That She Don't Send!"
input_raw = "Message: {message}"
input = input_raw.format(message=message)
instruction = "Does this customer-reported message indicate an AUP violation from the following categories? \n[A, B, C]\nIf yes, respond 'AUP'; if not, respond 'Others'."
prompt_template = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{instruction}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
prompt = prompt_template.format(instruction=instruction, input=input)
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to('cuda')
generation_config = GenerationConfig(
temperature=0,
top_p=1,
top_k=-1,
num_beams=4, # Number of beams for beam search
num_return_sequences=4, # Return all beams
)
generate_params = {
"input_ids": input_ids,
"generation_config": generation_config,
"return_dict_in_generate": True,
"output_scores": True,
"max_new_tokens": 128,
}
with torch.no_grad():
generation_output = model.generate(
input_ids=input_ids,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
max_new_tokens=128
)
s = generation_output.sequences[0]
output = tokenizer.decode(s,skip_special_tokens=True)
result = output.split('assistant')[1].strip()
DECODE CODE :
import torch
import torch.nn.functional as F
def analyze_beams(
generation_output,
tokenizer,
input_ids,
end_of_text_id=128001,
length_penalty=1.0,
ignore_after_first_eos=False
):
"""
Analyzes final beams from a Hugging Face generation output.
1) Excludes the original input tokens, only focusing on "newly generated" tokens.
2) Prints step-by-step tokens (ID & text) + log-probs.
3) Applies optional length penalty for the final "calculated score."
4) Optionally stops counting tokens after first <eos> if 'ignore_after_first_eos=True'.
:param generation_output: Object with attributes:
- sequences: final beam sequences (tensor shape [num_beams, total_seq_len])
- sequences_scores: final HF beam scores
- scores: list of per-step logits ([num_steps], each shape [num_beams, vocab_size])
:param tokenizer: A Hugging Face tokenizer to decode tokens into text.
:param input_ids: The original input_ids (so we can know how many tokens to skip).
:param end_of_text_id: The <eos> or <end_of_text> token ID (default=128001).
:param length_penalty: Exponent for length normalization.
:param ignore_after_first_eos: If True, we ignore any tokens after the first <eos>.
"""
# 1) Determine how many input tokens to skip
input_length = len(input_ids[0]) # e.g. shape [batch_size, seq_len]
print("\n=== HuggingFace Beam Analysis (Generated Tokens Only) ===")
print(f"Input sequence length: {input_length}")
# 2) Convert generation_output.scores into shape [num_beams, steps, vocab_size]
logits = torch.stack(generation_output.scores, dim=1) # shape [num_beams, steps, vocab_size]
log_probs = F.log_softmax(logits, dim=-1) # shape [num_beams, steps, vocab_size]
beam_sequences = generation_output.sequences
beam_scores = generation_output.sequences_scores
num_beams = beam_sequences.shape[0]
steps_available = log_probs.shape[1]
vocab_size = log_probs.shape[2]
# 3) Analyze each beam
for beam_idx in range(num_beams):
print(f"\n--- Beam {beam_idx + 1} ---")
# Slice out only the newly generated portion (excluding input)
full_sequence = beam_sequences[beam_idx]
generated_sequence = full_sequence[input_length:] # This is your "generated" part
# Decode text
generated_text = tokenizer.decode(generated_sequence, skip_special_tokens=True)
print(f"Generated Sequence (IDs): {generated_sequence.tolist()}")
print(f"Generated Text: {generated_text}")
print("\nStep-by-Step Analysis:")
beam_score_sum = 0.0
used_step_count = 0
# We'll iterate over each newly generated token
for step_idx, token_id in enumerate(generated_sequence):
if step_idx >= steps_available:
# We've run out of log_probs steps
break
# Retrieve distribution for this beam at this step
# shape [vocab_size]
token_log_probs = log_probs[beam_idx, step_idx]
# The log-prob for the chosen token_id
token_logp = token_log_probs[token_id].item()
# Accumulate beam score
beam_score_sum += token_logp
used_step_count += 1
# Print step info
token_text = tokenizer.decode([token_id], skip_special_tokens=True)
print(
f"Step {step_idx + 1}: "
f"Token='{token_text}' (ID={token_id}), LogProb={token_logp:.6f}"
)
# If ignoring repeated <eos>, we break after the first <eos> token
if ignore_after_first_eos and token_id == end_of_text_id:
break
# 4) Apply length penalty
# If all tokens are used, used_step_count is the length; otherwise we truncated early
final_len = used_step_count if used_step_count > 0 else 1
calculated_score = beam_score_sum / (final_len ** length_penalty)
# 5) Print results
print("\nFinal Scores:")
# Show Hugging Face's final beam score
hf_score = beam_scores[beam_idx].item()
print(f" HF Cumulative Sequence Score: {hf_score:.6f}")
print(f" Calculated Score: {calculated_score:.6f}")
print("\nDebug Info:")
print(f" Full sequence length: {len(full_sequence)} (including input)")
print(f" Generated sequence length: {len(generated_sequence)}")
print(f" Steps of log_probs used: {used_step_count}")
print(f" Steps of log_probs avail: {steps_available}")
print(f" Vocab size: {vocab_size}")
Expected behavior
Expected a cumulative probability of (-0.741240 - 28.38378 - 32.667973) / 3 = -20.597667 since no length_penalty is being applied.
The text was updated successfully, but these errors were encountered:
I’ve been analyzing the beam search decoding process and noticed an inconsistency. When I manually construct a sequence using the highest cumulative log probabilities from the top-k tokens at each step, it does not match the model’s final generated output. Additionally, some words in the generated output are not even present in the top-k tokens. Also, the cumulative log probability of the model’s output is lower than the manually computed one. Could other hidden factors be influencing this? Any insights would be appreciated.
System Info
Hello Hugging Face community,
I’m working with a LLaMA-based model that has a LoRA (Low-Rank Adapter) applied, and I’m using beam search in Transformers. I’m trying to debug how the final beam scores are computed, because the step-by-step log probabilities I print out look far more negative than the final “sequence score” reported by Hugging Face.
Below is a sample of my debug output for 4 beams, each showing:
Generated Sequence (token IDs, excluding the prompt/input).
Generated Text (decoded).
Step-by-Step Analysis: Each newly generated token’s log probability.
HF Cumulative Sequence Score (final beam score from generation_output.sequences_scores).
Debug Info (lengths, how many log-prob steps were used vs. available).
=== HuggingFace Beam Analysis (Generated Tokens Only) ===
Input sequence length: 148
--- Beam 1 ---
Generated Sequence (IDs): [32, 3202, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001]
Generated Text: AUP
Step-by-Step Analysis:
Step 1: Token='A' (ID=32), LogProb=-0.741240
Step 2: Token='UP' (ID=3202), LogProb=-28.383789
Step 3: Token='' (ID=128001), LogProb=-32.667973
Final Scores:
HF Cumulative Sequence Score: -0.247081
--- Beam 2 ---
Generated Sequence (IDs): [51154, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001]
Generated Text: Others
Step-by-Step Analysis:
Step 1: Token='Others' (ID=51154), LogProb=-0.647490
Step 2: Token='' (ID=128001), LogProb=-29.399292
Final Scores:
HF Cumulative Sequence Score: -0.323745
--- Beam 3 ---
Generated Sequence (IDs): [32, 3202, 320, 6546, 1428, 11, 10984, 49541, 13, 15388, 3298, 8, 128001]
Generated Text: AUP (CSAM, Encourg. Illegal Act)
Step-by-Step Analysis:
Step 1: Token='A' (ID=32), LogProb=-0.741240
Step 2: Token='UP' (ID=3202), LogProb=-20.869020
Step 3: Token=' (' (ID=320), LogProb=-9.416358
Step 4: Token='CS' (ID=6546), LogProb=-19.269587
Step 5: Token='AM' (ID=1428), LogProb=-23.486216
Step 6: Token=',' (ID=11), LogProb=-10.883574
Step 7: Token=' Enc' (ID=10984), LogProb=-0.144973
Step 8: Token='ourg' (ID=49541), LogProb=-0.001301
Step 9: Token='.' (ID=13), LogProb=-0.001659
Step 10: Token=' Illegal' (ID=15388), LogProb=-20.425816
Step 11: Token=' Act' (ID=3298), LogProb=-14.907486
Step 12: Token=')' (ID=8), LogProb=-0.150186
Step 13: Token='' (ID=128001), LogProb=-17.213655
Final Scores:
HF Cumulative Sequence Score: -1.447294
--- Beam 4 ---
Generated Sequence (IDs): [32, 3202, 320, 6546, 1428, 11, 10984, 49541, 13, 15388, 3298, 6266, 128001]
Generated Text: AUP (CSAM, Encourg. Illegal Act.)
Step-by-Step Analysis:
Step 1: Token='A' (ID=32), LogProb=-0.741240
Step 2: Token='UP' (ID=3202), LogProb=-28.162111
Step 3: Token=' (' (ID=320), LogProb=-10.757921
Step 4: Token='CS' (ID=6546), LogProb=-6.859391
Step 5: Token='AM' (ID=1428), LogProb=-20.384962
Step 6: Token=',' (ID=11), LogProb=-15.148496
Step 7: Token=' Enc' (ID=10984), LogProb=-0.298849
Step 8: Token='ourg' (ID=49541), LogProb=-18.535187
Step 9: Token='.' (ID=13), LogProb=-0.006747
Step 10: Token=' Illegal' (ID=15388), LogProb=-14.434349
Step 11: Token=' Act' (ID=3298), LogProb=-12.582914
Step 12: Token='.)' (ID=6266), LogProb=-12.790556
Step 13: Token='' (ID=128001), LogProb=-20.104782
Final Scores:
HF Cumulative Sequence Score: -1.464120
The Question
How does Hugging Face’s beam search compute the final scores (e.g., −0.247081, −0.323745, −1.447294, −1.464120) given the very negative individual log probabilities?
For example, for the first beam, I expected a cumulative probability of (-0.741240 - 28.38378 - 32.667973) / 3 = -20.597667 since no length_penalty is being applied. However, the final sequences_scores from HF differ significantly from any straightforward summation of the listed token log-probs, even when accounting for a length_penalty.
Can someone help clarify how these scores are calculated?
Who can help?
@gante @ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
DECODE CODE :
Expected behavior
Expected a cumulative probability of (-0.741240 - 28.38378 - 32.667973) / 3 = -20.597667 since no length_penalty is being applied.
The text was updated successfully, but these errors were encountered: