Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix training crash issue on multi-nodes when dataloader_num_workers>0 #1721

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Wei-Lin-Intel
Copy link
Contributor

What does this PR do?

Fixes # (issue)
Relevant Issue: SW-207456
Background: In Gaudi2 Host NIC environment, it is found that the multi-nodes training would be stuck in "pt_data_worker" stage (Synapse 1.19), or throwing errors like RuntimeError: DataLoader worker (pid(s) 12844) exited unexpectedly (Synapse 1.17) when dataloader_num_workers is set to larger than 0.

According to the habana document torch-multiprocessing-for-dataloaders, the default start method of dataloader is fork which may result in undefined behavior. Thus it is better to set multiprocessing_context as forkserver or spawn in the initialization stage of Gaudi Trainer when dataloader_num_workers > 0.

In Unix system, forkserver would be faster than spawn to start a new process, and only necessary resources would be inherited. Thus forkserver is preferred. In this PR, such change has been applied to get_train_dataloader, get_eval_dataloader, and get_test_dataloader, respectively.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@Wei-Lin-Intel
Copy link
Contributor Author

@ssarkar2 @libinta Please help to review this PR, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant