Fix how we compute the final non-padding token for ForSequenceClassification models #35911

Rocketknight1 · 2025-01-27T15:20:36Z

We have a lot of CLM models with ForSequenceClassification heads. These models are supposed to use the hidden state at the final non-padding token as the input to the classification head. However, the way they compute it is a bit weird - they get the index of the leftmost token that is equal to pad_token_id and subtract 1 from it. This has a few issues:

It breaks on left-padding
It creates index arithmetic issues when pad_token_id is absent, that need workarounds
It depends on implementation details of argmax(), specifically that when multiple indices have the same maximum value, it always returns the smallest one

This PR replaces that logic with simpler logic that actually searches for what we want, the rightmost non-padding token, not the token next to the leftmost padding token. This means the same logic works with left-padding, right-padding, no-padding, or even padding on both sides (I don't think any models do that, but we're ready if they do!)

Fixes #30004
Fixes #35352
Fixes #35909

…y other models)

HuggingFaceDocBuilderDev · 2025-01-27T19:40:36Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Rocketknight1 · 2025-01-28T15:21:20Z

cc @ArthurZucker for core maintainer review - but if you have too much to do, let me know and I'll find another reviewer!

Rocketknight1 · 2025-01-28T15:38:06Z

cc @Cyrilvallez actually, as core-maintainer-in-training

Cyrilvallez

Hey! Nicely done, indeed much simpler than before! Just added some small comments! Let me know what you think!

src/transformers/models/bloom/modeling_bloom.py

Cyrilvallez · 2025-01-29T10:50:16Z

src/transformers/models/bloom/modeling_bloom.py

+                # To handle both left- and right- padding, we take the rightmost token that is not equal to pad_token_id
+                non_pad_mask = torch.ne(input_ids, self.config.pad_token_id).int().to(logits.device)
+                token_indices = torch.arange(input_ids.shape[-1], device=logits.device)
+                last_non_pad_token = (token_indices * non_pad_mask).max(-1).values


Small nit as well, but maybe argmax() would be simpler than max().values

This isn't an argmax! It's actually a fake argmax where I make a masked index array and then compute the max. It just looks a lot like an argmax, lol

Unless I'm mistaken, each value of 1 in the mask will take its index as value when multiplying with torch.arange, so max and argmax are fully equivalent here. But it's a detail anyway!

True, yes! I could just use argmax instead

Cyrilvallez · 2025-01-29T10:53:58Z

src/transformers/models/ctrl/modeling_tf_ctrl.py

-            loss = self.hf_compute_loss(tf.reshape(labels, [-1, 1]), tf.reshape(in_logits, [-1, self.num_labels]))
-
-        pooled_logits = in_logits if in_logits is not None else logits
+            loss = self.hf_compute_loss(tf.reshape(labels, [-1]), tf.reshape(pooled_logits, [-1, self.num_labels]))


I see the reshape dim has changed, I suppose it's the exact same without the dim 1, but just checking?

Yes, this should be the same or more correct! The loss is tested by the test_pt_tf_equivalence tests, so we should see if they go out of alignment

Cyrilvallez · 2025-01-29T11:05:57Z

src/transformers/models/gemma/modeling_gemma.py

+                last_non_pad_token = -1
+                logger.warning_once(
+                    f"{self.__class__.__name__} will not detect padding tokens in `inputs_embeds`. Results may be "
+                    "unexpected if using padding tokens in conjunction with `inputs_embeds.`"
+                )


Instead of adding the warning everywhere, maybe we should do it the other way around and actually remove it from the few classes where it is present? It is a fairly niche case, and people using it should be aware of the caveat IMO. It's nice to have less warnings in general, WDYT? (It would allow to simplify branching as well)

Hmm - I think a lot of classes actually support inputs_embeds as well as input_ids, so we still need this on most classes I think! It just only fires when users actually pass inputs_embeds and not input_ids.

Be aware that it breaks torch.compile as well, would anyone want to use it! But once again, it is quite a niche case, and it's already here for some models, so I'll let you judge -- we can keep it if you think that removing it would bring confusion/error for users 🤗

I think it's better to leave the warning, because the results will generally be wrong in that case. In future, we might consider raising an error and asking users to supply an attention_mask in that case!

Rocketknight1 mentioned this pull request Jan 27, 2025

Maybe the way SequenceClassification Model calculates the last non-pad token is not reasonable. #35352

Open

4 tasks

Rocketknight1 added 9 commits January 27, 2025 19:14

Fix how we compute the final non-padding token for Gemma (and probabl…

f74d4dd

…y other models)

.size() -> .shape[]

067d99a

Propagating changes to other models

3e381f4

Propagating changes to other models

7cc1396

Change it for all ForSequenceClassification models

d3b7c99

Fix batch dim

1c11edc

More TF fixes

13a670e

Copy the TF fix around as well

c671810

Correct layer name for TFCTRL

8ccde63

Rocketknight1 force-pushed the fix_sequence_classification_padding_side branch from 4620592 to 8ccde63 Compare January 27, 2025 19:14

Cyrilvallez reviewed Jan 29, 2025

View reviewed changes

Rocketknight1 added 3 commits January 29, 2025 16:09

Cleaner .to()

8c69579

Clean up the nested if-else

172cfd7

Use argmax() instead of .max().values

26d554e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix how we compute the final non-padding token for ForSequenceClassification models #35911

Fix how we compute the final non-padding token for ForSequenceClassification models #35911

Rocketknight1 commented Jan 27, 2025 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 27, 2025

Rocketknight1 commented Jan 28, 2025

Rocketknight1 commented Jan 28, 2025

Cyrilvallez left a comment

Cyrilvallez Jan 29, 2025

Rocketknight1 Jan 29, 2025

Cyrilvallez Jan 29, 2025

Rocketknight1 Jan 30, 2025

Cyrilvallez Jan 29, 2025

Rocketknight1 Jan 29, 2025

Cyrilvallez Jan 29, 2025

Rocketknight1 Jan 29, 2025

Cyrilvallez Jan 29, 2025

Rocketknight1 Jan 30, 2025

Fix how we compute the final non-padding token for ForSequenceClassification models #35911

Are you sure you want to change the base?

Fix how we compute the final non-padding token for ForSequenceClassification models #35911

Conversation

Rocketknight1 commented Jan 27, 2025 • edited Loading

HuggingFaceDocBuilderDev commented Jan 27, 2025

Rocketknight1 commented Jan 28, 2025

Rocketknight1 commented Jan 28, 2025

Cyrilvallez left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rocketknight1 commented Jan 27, 2025 •

edited

Loading