Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for limit_train_batches to megatron sampler classes #10648

Closed
wants to merge 2 commits into from

Conversation

trias702
Copy link
Collaborator

What does this PR do ?

Adds support for limit_train_batches (as used in PTL) to work correctly. Without this PR, use of limit_train_batches less than 1.0 or num_samples will not work correctly when doing multi-epoch training in downstream libraries such as Nemo-Aligner.

Collection: NLP

Changelog

  • nemo/collections/nlp/data/language_modeling/megatron/data_samplers.py: alters constructor parameters for limit_train_batches and adds additional constructor logic to limit total_samples as necessary
  • nemo/collections/nlp/data/language_modeling/megatron/megatron_batch_samplers.py: alters constructor parameters for limit_train_batches and adds additional constructor logic to limit total_samples as necessary

Usage

You can now pass an additional parameter, limit_train_batches to all 4 Sampler classes

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: trias702 <[email protected]>
Copy link
Contributor

[🤖]: Hi @trias702 👋,

I just wanted to let you know that, you know, a CICD pipeline for this PR just finished successfully ✨

So it might be time to merge this PR or like to get some approvals 🚀

But I'm just a 🤖 so I'll leave it you what to do next.

Have a great day!

//cc @ko3n1g

@@ -133,8 +145,7 @@
return (num_available_samples + self.global_batch_size - 1) // self.global_batch_size

@abc.abstractmethod
def __iter__(self):
...
def __iter__(self): ...

Check notice

Code scanning / CodeQL

Statement has no effect Note

This statement has no effect.
Copy link
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Oct 11, 2024
@trias702
Copy link
Collaborator Author

jenkins

@github-actions github-actions bot removed the stale label Oct 15, 2024
Copy link
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Oct 29, 2024
@trias702
Copy link
Collaborator Author

trias702 commented Nov 2, 2024 via email

@github-actions github-actions bot removed the stale label Nov 2, 2024
Copy link
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Nov 16, 2024
@trias702
Copy link
Collaborator Author

trias702 commented Nov 16, 2024 via email

Copy link
Collaborator

@odelalleau odelalleau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first sight it seems to me that this parameter doesn't belong to the sampler, and it should be up to the caller to provide the correct value for total_samples. Why do you think we need it in the sampler?

@github-actions github-actions bot removed the stale label Nov 17, 2024
Copy link
Contributor

github-actions bot commented Dec 1, 2024

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@trias702
Copy link
Collaborator Author

trias702 commented Dec 4, 2024

At first sight it seems to me that this parameter doesn't belong to the sampler, and it should be up to the caller to provide the correct value for total_samples. Why do you think we need it in the sampler?

We spoke about this before in a meeting, and I told you we can either hack it in via the algorithm class, or put it inside the Sampler, and you said inside the Sampler was the correct place for it.

@odelalleau
Copy link
Collaborator

At first sight it seems to me that this parameter doesn't belong to the sampler, and it should be up to the caller to provide the correct value for total_samples. Why do you think we need it in the sampler?

We spoke about this before in a meeting, and I told you we can either hack it in via the algorithm class, or put it inside the Sampler, and you said inside the Sampler was the correct place for it.

Ah, my memory is failing me, though as discussed in NVIDIA/NeMo-Aligner#321 (comment) there's a way to do it without it being a hack -- I assume we hadn't thought of this option when we spoke about it (really can't remember, sorry).

@github-actions github-actions bot removed the stale label Dec 5, 2024
Copy link
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Dec 20, 2024
Copy link
Contributor

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this Dec 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants