4.47.1 Hugging Face Trainer loss accumulated by sum instead of mean #35556

jdf-prog · 2025-01-07T20:11:52Z

System Info

Version: transformers==4.47.1

When I training model using this version of hugging face trainer, the loss is accumulated by sum instead of mean. Or it's better saying that the tr_loss_step did not divide by the global batch size. That means, the reported loss will scale proportionally with the global batch size.

When I change the transformer version to be 4.44.0, this problem is and everything works good.

My global batch size is set to 128 here. You can see from the above image that tr_loss_step=original_loss / 128, which is not the case when transformers version is 4.47.1, where tr_loss_step=original_loss.

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Any training with transformers==4.47.1

Expected behavior

loss accumulated by sum instead of mean

The text was updated successfully, but these errors were encountered:

Rocketknight1 · 2025-01-08T13:48:48Z

cc @SunMarc @muellerzr

SunMarc · 2025-01-09T11:10:59Z

Can you try with the main branch of transformers, we fixed a couple of issue grad accumulation, so your issue might be solved. Otherwise, can you share a minimal reproducer ? Thanks !

jdf-prog · 2025-01-09T11:38:24Z

Can you try with the main branch of transformers, we fixed a couple of issue grad accumulation, so your issue might be solved. Otherwise, can you share a minimal reproducer ? Thanks !

A simple reproduce would be using llamafactory:

git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .
llamafactory-cli train examples/train_lora/llama3_lora_dpo.yaml

And test for transformers==4.47.1 and transformers<=4.45.2, where the former should has higher loss than the later.

However, I cannot test transformers==4.48.0 on the main branch since there seems a lot of change in the modeling_llama file.

fan-yang1 · 2025-01-09T16:04:13Z

I also encountered the same problem in this version, when adjusting gradient_accumulation_steps, the loss scales accordingly

paussus · 2025-01-14T07:32:29Z

I also encountered this problem, but I found that for me, it doesn't happen with all the models. It happens with "facebook/opt-350m" and "facebook/opt-1.3b", but not with "TinyLlama/TinyLlama_v1.1". The difference may be that opt is in 16 bits while TinyLlama in 32 bits. Using a Titan RTX.

I updated to 4.48.0 and the issue seems to be corrected for opt, but not for other models like "projecte-aina/FLOR-760M".

techkang · 2025-01-15T08:44:14Z

I think the bug is going to be fixed after #35651 merged.

jdf-prog added the bug label Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4.47.1 Hugging Face Trainer loss accumulated by sum instead of mean #35556

4.47.1 Hugging Face Trainer loss accumulated by sum instead of mean #35556

jdf-prog commented Jan 7, 2025 •

edited

Loading

Rocketknight1 commented Jan 8, 2025

SunMarc commented Jan 9, 2025

jdf-prog commented Jan 9, 2025

fan-yang1 commented Jan 9, 2025

paussus commented Jan 14, 2025 •

edited

Loading

techkang commented Jan 15, 2025

4.47.1 Hugging Face Trainer loss accumulated by sum instead of mean #35556

4.47.1 Hugging Face Trainer loss accumulated by sum instead of mean #35556

Comments

jdf-prog commented Jan 7, 2025 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Rocketknight1 commented Jan 8, 2025

SunMarc commented Jan 9, 2025

jdf-prog commented Jan 9, 2025

fan-yang1 commented Jan 9, 2025

paussus commented Jan 14, 2025 • edited Loading

techkang commented Jan 15, 2025

jdf-prog commented Jan 7, 2025 •

edited

Loading

paussus commented Jan 14, 2025 •

edited

Loading