Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4.47.1 Hugging Face Trainer loss accumulated by sum instead of mean #35556

Open
4 tasks
jdf-prog opened this issue Jan 7, 2025 · 6 comments
Open
4 tasks

4.47.1 Hugging Face Trainer loss accumulated by sum instead of mean #35556

jdf-prog opened this issue Jan 7, 2025 · 6 comments
Labels

Comments

@jdf-prog
Copy link

jdf-prog commented Jan 7, 2025

System Info

Version: transformers==4.47.1

When I training model using this version of hugging face trainer, the loss is accumulated by sum instead of mean. Or it's better saying that the tr_loss_step did not divide by the global batch size. That means, the reported loss will scale proportionally with the global batch size.

image

When I change the transformer version to be 4.44.0, this problem is and everything works good.
image

My global batch size is set to 128 here. You can see from the above image that tr_loss_step=original_loss / 128, which is not the case when transformers version is 4.47.1, where tr_loss_step=original_loss.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Any training with transformers==4.47.1

Expected behavior

loss accumulated by sum instead of mean

@jdf-prog jdf-prog added the bug label Jan 7, 2025
@Rocketknight1
Copy link
Member

cc @SunMarc @muellerzr

@SunMarc
Copy link
Member

SunMarc commented Jan 9, 2025

Can you try with the main branch of transformers, we fixed a couple of issue grad accumulation, so your issue might be solved. Otherwise, can you share a minimal reproducer ? Thanks !

@jdf-prog
Copy link
Author

jdf-prog commented Jan 9, 2025

Can you try with the main branch of transformers, we fixed a couple of issue grad accumulation, so your issue might be solved. Otherwise, can you share a minimal reproducer ? Thanks !

A simple reproduce would be using llamafactory:

git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .
llamafactory-cli train examples/train_lora/llama3_lora_dpo.yaml

And test for transformers==4.47.1 and transformers<=4.45.2, where the former should has higher loss than the later.

However, I cannot test transformers==4.48.0 on the main branch since there seems a lot of change in the modeling_llama file.

@fan-yang1
Copy link

I also encountered the same problem in this version, when adjusting gradient_accumulation_steps, the loss scales accordingly

@paussus
Copy link

paussus commented Jan 14, 2025

I also encountered this problem, but I found that for me, it doesn't happen with all the models. It happens with "facebook/opt-350m" and "facebook/opt-1.3b", but not with "TinyLlama/TinyLlama_v1.1". The difference may be that opt is in 16 bits while TinyLlama in 32 bits. Using a Titan RTX.

I updated to 4.48.0 and the issue seems to be corrected for opt, but not for other models like "projecte-aina/FLOR-760M".

@techkang
Copy link
Contributor

I think the bug is going to be fixed after #35651 merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants