-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to predownload data from s3 at the start of each checkpoint. #280
base: main
Are you sure you want to change the base?
Conversation
959d8f1
to
101a787
Compare
101a787
to
d20b585
Compare
open_lm/file_utils.py
Outdated
stderr=subprocess.PIPE, | ||
) | ||
if result.returncode != 0: | ||
raise RuntimeError(f"Error: Failed to download data to local storage: {result.stderr.decode('utf-8')}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we try this a few times before erroring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need a barrier after the copy so the other workers don't start reading before worker 0 finishes copying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add retries - there is already a dist barrier to make sure the training string is the same across all nodes, so no need to add more.
This adds an option that allows for predownloading of data at the start of each checkpoint to local storage. This helps with potential s3 errors.