Fix training crash issue on multi-nodes when dataloader_num_workers>0 #1721
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes # (issue)
Relevant Issue: SW-207456
Background: In Gaudi2 Host NIC environment, it is found that the multi-nodes training would be stuck in "pt_data_worker" stage (Synapse 1.19), or throwing errors like
RuntimeError: DataLoader worker (pid(s) 12844) exited unexpectedly
(Synapse 1.17) whendataloader_num_workers
is set to larger than 0.According to the habana document torch-multiprocessing-for-dataloaders, the default start method of dataloader is
fork
which may result in undefined behavior. Thus it is better to setmultiprocessing_context
asforkserver
orspawn
in the initialization stage of Gaudi Trainer whendataloader_num_workers
> 0.In Unix system,
forkserver
would be faster thanspawn
to start a new process, and only necessary resources would be inherited. Thusforkserver
is preferred. In this PR, such change has been applied toget_train_dataloader
,get_eval_dataloader
, andget_test_dataloader
, respectively.Before submitting