PR #35438 introduced a new bug #35649

techkang · 2025-01-13T08:21:04Z

System Info

(base) MBP-HD6JD9Q599-2052 :: ~/code/transformers ‹main*› % transformers-cli env 1 ↵

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

transformers version: 4.49.0.dev0
Platform: macOS-14.6.1-arm64-arm-64bit
Python version: 3.12.4
Huggingface_hub version: 0.27.1
Safetensors version: 0.4.3
Accelerate version: 1.2.1
Accelerate config: not found
PyTorch version (GPU?): 2.5.1 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:

Who can help?

@muellerzr @hiyouga @ArthurZucker @SunMarc

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

export RUN_SLOW=True
pytest tests/trainer/test_trainer.py::TrainerIntegrationPrerunTest::test_gradient_accumulation_loss_alignment_with_loss_func

======================================================= short test summary info =======================================================
FAILED tests/trainer/test_trainer.py::TrainerIntegrationPrerunTest::test_gradient_accumulation_loss_alignment_with_loss_func - AssertionError: 3.0949999999999998 not less than 0.01 : Difference 3.0949999999999998 is not within 0.01
=================================================== 1 failed, 2 warnings in 54.91s ====================================================

Expected behavior

Test Passed.

The text was updated successfully, but these errors were encountered:

techkang · 2025-01-13T08:22:00Z

PR link: #35438

techkang · 2025-01-13T08:48:03Z

I think the PR: #35438 should be reverted and the proper way to fix the bug mentioned in the PR is as follows.

In the following code, loss is scaled when num_items_in_batch is None.
https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L3712-L3713

But the true meaning of this code is to scale loss when GA bug fix is not performed. This is not identical to num_items_in_batch is None after recent PRs. So it should be changed to

if not self.model_accepts_loss_kwargs and self.compute_loss_func is None

muellerzr · 2025-01-13T09:20:52Z

Thanks! Would you like to make a PR for this? Else I can do so today

techkang · 2025-01-13T09:24:30Z

@muellerzr Thanks for reply. I will open a PR today.

hiyouga · 2025-01-13T09:36:21Z

Hi @techkang , it has been an evidence that #35121 introduces bug making the loss of the Qwen2VL model incorrect through our rigorous experiments in #35438 . I think we should not only focus on the model with loss function but also pay attention to the models without loss_kwargs. There should be a solution that let both the two conditions work instead of simply reverting our fix. cc @muellerzr

techkang · 2025-01-13T09:39:19Z

Hi @hiyouga , #35121 indeed introduced a bug but I don't think #35438 is the proper way to fix it. Can you try to varify the Qwen2VL loss by new PR: #35651?

hiyouga · 2025-01-13T09:41:39Z

@techkang Yep, the new PR looks better to me, let us perform some experiments on it

techkang added the bug label Jan 13, 2025

techkang changed the title ~~PR https://github.com/huggingface/transformers/pull/35438 introduced a new bug~~ PR #35438 introduced a new bug Jan 13, 2025

techkang linked a pull request Jan 13, 2025 that will close this issue

Fix condition when GA loss bug fix is not performed #35651

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR #35438 introduced a new bug #35649

PR #35438 introduced a new bug #35649

techkang commented Jan 13, 2025

techkang commented Jan 13, 2025

techkang commented Jan 13, 2025 •

edited

Loading

muellerzr commented Jan 13, 2025

techkang commented Jan 13, 2025

hiyouga commented Jan 13, 2025 •

edited

Loading

techkang commented Jan 13, 2025

hiyouga commented Jan 13, 2025

PR #35438 introduced a new bug #35649

PR #35438 introduced a new bug #35649

Comments

techkang commented Jan 13, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

techkang commented Jan 13, 2025

techkang commented Jan 13, 2025 • edited Loading

muellerzr commented Jan 13, 2025

techkang commented Jan 13, 2025

hiyouga commented Jan 13, 2025 • edited Loading

techkang commented Jan 13, 2025

hiyouga commented Jan 13, 2025

techkang commented Jan 13, 2025 •

edited

Loading

hiyouga commented Jan 13, 2025 •

edited

Loading