-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4.47.1 Hugging Face Trainer loss accumulated by sum instead of mean #35556
Comments
Can you try with the main branch of transformers, we fixed a couple of issue grad accumulation, so your issue might be solved. Otherwise, can you share a minimal reproducer ? Thanks ! |
A simple reproduce would be using llamafactory:
And test for However, I cannot test |
I also encountered the same problem in this version, when adjusting gradient_accumulation_steps, the loss scales accordingly |
I also encountered this problem, but I found that for me, it doesn't happen with all the models. It happens with "facebook/opt-350m" and "facebook/opt-1.3b", but not with "TinyLlama/TinyLlama_v1.1". The difference may be that opt is in 16 bits while TinyLlama in 32 bits. Using a Titan RTX. I updated to 4.48.0 and the issue seems to be corrected for opt, but not for other models like "projecte-aina/FLOR-760M". |
I think the bug is going to be fixed after #35651 merged. |
System Info
Version:
transformers==4.47.1
When I training model using this version of hugging face trainer, the loss is accumulated by sum instead of mean. Or it's better saying that the
tr_loss_step
did not divide by the global batch size. That means, the reported loss will scale proportionally with the global batch size.When I change the transformer version to be 4.44.0, this problem is and everything works good.
My global batch size is set to 128 here. You can see from the above image that
tr_loss_step=original_loss / 128
, which is not the case when transformers version is 4.47.1, wheretr_loss_step=original_loss
.Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
transformers==4.47.1
Expected behavior
loss accumulated by sum instead of mean
The text was updated successfully, but these errors were encountered: