Fix training crash issue on multi-nodes when dataloader_num_workers>0 #1721

Wei-Lin-Intel · 2025-01-25T15:08:33Z

What does this PR do?

Fixes # (issue)
Relevant Issue: SW-207456
Background: In Gaudi2 Host NIC environment, it is found that the multi-nodes training would be stuck in "pt_data_worker" stage (Synapse 1.19), or throwing errors like RuntimeError: DataLoader worker (pid(s) 12844) exited unexpectedly (Synapse 1.17) when dataloader_num_workers is set to larger than 0.

According to the habana document torch-multiprocessing-for-dataloaders, the default start method of dataloader is fork which may result in undefined behavior. Thus it is better to set multiprocessing_context as forkserver or spawn in the initialization stage of Gaudi Trainer when dataloader_num_workers > 0.

In Unix system, forkserver would be faster than spawn to start a new process, and only necessary resources would be inherited. Thus forkserver is preferred. In this PR, such change has been applied to get_train_dataloader, get_eval_dataloader, and get_test_dataloader, respectively.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

…der_num_workers>0

Wei-Lin-Intel · 2025-01-25T15:09:37Z

@ssarkar2 @libinta Please help to review this PR, thanks.

Fix training crash issue on multi-nodes host nic senario when dataloa…

680aade

…der_num_workers>0

Wei-Lin-Intel requested a review from regisss as a code owner January 25, 2025 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix training crash issue on multi-nodes when dataloader_num_workers>0 #1721

Fix training crash issue on multi-nodes when dataloader_num_workers>0 #1721

Wei-Lin-Intel commented Jan 25, 2025

Wei-Lin-Intel commented Jan 25, 2025

Fix training crash issue on multi-nodes when dataloader_num_workers>0 #1721

Are you sure you want to change the base?

Fix training crash issue on multi-nodes when dataloader_num_workers>0 #1721

Conversation

Wei-Lin-Intel commented Jan 25, 2025

What does this PR do?

Before submitting

Wei-Lin-Intel commented Jan 25, 2025