Improve Guidance for Using DDP in `examples/pytorch` #35667

caojiaolong · 2025-01-13T16:08:22Z

Feature request

The examples in examples/pytorch/ (e.g., semantic-segmentation) would benefit from clearer guidance on how to use Distributed Data Parallel (DDP) in trainer version.

Motivation

I modified the training script from run_semantic_segmentation.py for my task, and it worked well on one or two GPUs. However, when scaling to four GPUs, training became significantly slower. After several days of debugging, I realized that the default example in README.md does not use accelerate or another distributed launcher, which meant the script was running with Data Parallel (DP) instead of DDP the entire time.

The default command of trainer version provided in the README.md is:

python run_semantic_segmentation.py \
    --model_name_or_path nvidia/mit-b0 \
    --dataset_name segments/sidewalk-semantic \
    --output_dir ./segformer_outputs/ \
    --remove_unused_columns False \
    --do_train \
    --do_eval \
    --push_to_hub \
    --push_to_hub_model_id segformer-finetuned-sidewalk-10k-steps \
    --max_steps 10000 \
    --learning_rate 0.00006 \
    --lr_scheduler_type polynomial \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --logging_strategy steps \
    --logging_steps 100 \
    --eval_strategy epoch \
    --save_strategy epoch \
    --seed 1337

To enable DDP, the command needs to be modified as follows:

accelerate launch run_semantic_segmentation.py \
    --model_name_or_path nvidia/mit-b0 \
    --dataset_name segments/sidewalk-semantic \
    --output_dir ./segformer_outputs/ \
    --remove_unused_columns False \
    --do_train \
    --do_eval \
    --push_to_hub \
    --push_to_hub_model_id segformer-finetuned-sidewalk-10k-steps \
    --max_steps 10000 \
    --learning_rate 0.00006 \
    --lr_scheduler_type polynomial \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --logging_strategy steps \
    --logging_steps 100 \
    --eval_strategy epoch \
    --save_strategy epoch \
    --seed 1337

While this might be obvious to experienced users, it can be misleading for new users like me, as the default command seems to imply it works efficiently across any number of GPUs.

Your contribution

To address this, we could include a note or alert in the README.md, highlighting that to use DDP with the Trainer, it is necessary to replace python with accelerate launch, torchrun, or another distributed launcher. This would greatly improve clarity for beginners and help avoid confusion.

The text was updated successfully, but these errors were encountered:

Rocketknight1 · 2025-01-13T16:41:31Z

cc @muellerzr @SunMarc

muellerzr · 2025-01-13T17:02:45Z

I can agree with this :) (Or just state to use accelerate launch, since it will work on single GPU as well). Would you like to take a stab at it?

caojiaolong added the Feature request Request for a new feature label Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Guidance for Using DDP in `examples/pytorch` #35667

Improve Guidance for Using DDP in `examples/pytorch` #35667

caojiaolong commented Jan 13, 2025 •

edited

Loading

Rocketknight1 commented Jan 13, 2025

muellerzr commented Jan 13, 2025

Improve Guidance for Using DDP in examples/pytorch #35667

Improve Guidance for Using DDP in examples/pytorch #35667

Comments

caojiaolong commented Jan 13, 2025 • edited Loading

Feature request

Motivation

Your contribution

Rocketknight1 commented Jan 13, 2025

muellerzr commented Jan 13, 2025

Improve Guidance for Using DDP in `examples/pytorch` #35667

Improve Guidance for Using DDP in `examples/pytorch` #35667

caojiaolong commented Jan 13, 2025 •

edited

Loading