Added gaussian label smoother for number tokens #34

ad045 · 2025-01-06T21:06:16Z

Description:

This pull request introduces a label smoothing technique that applies Gaussian smoothing exclusively to number tokens. The key changes are outlined below:

Gaussian Label Smoother:
- Implemented the GaussianLabelSmoother class in src/ntl/utils/label_smoother.py.
- Controlled via two new ModelArguments: gaussian_label_smoother and label_smoother_sigma.
Number Token Selector:
- Extracted the token selection logic from NumberTokenLoss into a new NumberTokenSelector class located in src/ntl/utils/number_token_selector.py.
- Utilized by both GaussianLabelSmoother and NumberTokenLoss to consistently select number tokens.
Other Changes:
- Updated src/ntl/run_language_modeling.py and src/ntl/trainer.py to support the new label smoothing feature.
- Modified src/ntl/loss_functions/number_token_loss.py to integrate with the new NumberTokenSelector.
- Updated src/ntl/args.py to include new arguments for the Gaussian Label Smoother.
- Added unit tests to ensure the correctness of the GaussianLabelSmoother and NumberTokenSelector.

Types of Changes

New feature (non-breaking change which adds functionality)
Refactor (non-breaking change which improves the code structure)

…se. Added more extensive testing of the gaussian smoother.

* Refractored number token loss so get a number token selector * Added first implementation of gaussian label smoother * Added test cases for the label smoother * Small change to args * added changes to number_token_loss and label_smoother * added number_token_selector.py * added init * changed trainer * Added comments, and valid mask before one-hot encoding * added a case differenciation for sigma=0 * For label_smoother: Fixed gradient flow for no valid number tokens case. Added more extensive testing of the gaussian smoother. * Fixed bug in test file * nvocab fix * Bigger commit, will probably fail the gce tests * Cleaned up * Cleaned up --------- Co-authored-by: ad045 <[email protected]>

jannisborn

Great work @ad045 ! Seems very flexible and easy to use. Super solid job on the test suite 👍🏼

I left some cosmetics comments where unnecessary things can be removed, but conceptually we are ready to merge already now!

jannisborn · 2025-01-07T09:56:50Z

src/ntl/run_language_modeling.py

@@ -349,6 +364,8 @@ def run_language_modeling(model_args: ModelArguments, training_args: TrainingArg
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
+        # selector=selector, 


Remove comment

jannisborn · 2025-01-07T09:57:08Z

src/ntl/run_language_modeling.py

+            selector=selector    
+        )
+    else: 
+        selector = None


Remove selector since it is unused downstream

jannisborn · 2025-01-07T23:14:28Z

src/ntl/utils/number_token_selector.py

+                self.nvocab[id] = self.tokenizer.decode_number_token(token, ignore_order=True)
+
+
+    def select_number_tokens(self, logits: Tensor, labels: Tensor):


Labels are not used in here so I would remove them from the signature

and remove them from the return statement below then obviously

jannisborn · 2025-01-07T23:15:50Z

src/ntl/utils/label_smoother.py

+                raise AttributeError("The selector must have an attribute 'nvocab' representing the number of valid vocab tokens.")
+
+            # Select number tokens
+            logits, labels, number_tokens = self.selector.select_number_tokens(logits, labels)


remove labels here

jannisborn · 2025-01-07T23:15:59Z

src/ntl/loss_functions/number_token_loss.py

-        # Create a mask to filter out non-digit tokens
-        number_tokens = ~torch.isnan(self.nvocab)
-        logits = logits[:, :, number_tokens]
+        logits, labels, number_tokens = self.selector.select_number_tokens(logits, labels)


remove labels here also

ad045 added 16 commits December 29, 2024 07:40

Refractored number token loss so get a number token selector

b2668d3

Added first implementation of gaussian label smoother

00f7dec

Added test cases for the label smoother

e3c3d3f

Small change to args

31eabb4

added changes to number_token_loss and label_smoother

c9911f6

added number_token_selector.py

ce243f4

added init

c2fc4fc

changed trainer

0743276

Added comments, and valid mask before one-hot encoding

90714a9

added a case differenciation for sigma=0

4646bb3

For label_smoother: Fixed gradient flow for no valid number tokens ca…

17a67b6

…se. Added more extensive testing of the gaussian smoother.

Fixed bug in test file

628dff5

nvocab fix

1abc067

Bigger commit, will probably fail the gce tests

0e88bc2

Cleaned up

a98fa3a

Cleaned up

b4ef612

ad045 marked this pull request as ready for review January 6, 2025 21:07

jannisborn mentioned this pull request Jan 7, 2025

Gce #35

Merged

jannisborn approved these changes Jan 7, 2025

View reviewed changes

Jonas Zausinger and others added 6 commits January 8, 2025 16:27

made gce compatible with ntl

e7f8108

Cosmetic fix: Removed unused labels

f5a9140

Cosmetic fix: Removed unused labels

5ca9d3b

Fix: Removed comment and unused selector

b6fb1a3

Fix: Removed unused labels

92ef6fc

Merge branch 'gce' into gce

71316e7

zausin33 closed this Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added gaussian label smoother for number tokens #34

Added gaussian label smoother for number tokens #34

ad045 commented Jan 6, 2025

jannisborn left a comment •

edited

Loading

jannisborn Jan 7, 2025

jannisborn Jan 7, 2025

jannisborn Jan 7, 2025

jannisborn Jan 7, 2025

jannisborn Jan 7, 2025

jannisborn Jan 7, 2025

		self.nvocab[id] = self.tokenizer.decode_number_token(token, ignore_order=True)


		def select_number_tokens(self, logits: Tensor, labels: Tensor):

Added gaussian label smoother for number tokens #34

Added gaussian label smoother for number tokens #34

Conversation

ad045 commented Jan 6, 2025

Description:

Types of Changes

jannisborn left a comment • edited Loading

Choose a reason for hiding this comment

jannisborn Jan 7, 2025

Choose a reason for hiding this comment

jannisborn Jan 7, 2025

Choose a reason for hiding this comment

jannisborn Jan 7, 2025

Choose a reason for hiding this comment

jannisborn Jan 7, 2025

Choose a reason for hiding this comment

jannisborn Jan 7, 2025

Choose a reason for hiding this comment

jannisborn Jan 7, 2025

Choose a reason for hiding this comment

jannisborn left a comment •

edited

Loading