Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Guidance for Using DDP in examples/pytorch #35667

Open
caojiaolong opened this issue Jan 13, 2025 · 2 comments
Open

Improve Guidance for Using DDP in examples/pytorch #35667

caojiaolong opened this issue Jan 13, 2025 · 2 comments
Labels
Feature request Request for a new feature

Comments

@caojiaolong
Copy link

caojiaolong commented Jan 13, 2025

Feature request

The examples in examples/pytorch/ (e.g., semantic-segmentation) would benefit from clearer guidance on how to use Distributed Data Parallel (DDP) in trainer version.

Motivation

I modified the training script from run_semantic_segmentation.py for my task, and it worked well on one or two GPUs. However, when scaling to four GPUs, training became significantly slower. After several days of debugging, I realized that the default example in README.md does not use accelerate or another distributed launcher, which meant the script was running with Data Parallel (DP) instead of DDP the entire time.

The default command of trainer version provided in the README.md is:

python run_semantic_segmentation.py \
    --model_name_or_path nvidia/mit-b0 \
    --dataset_name segments/sidewalk-semantic \
    --output_dir ./segformer_outputs/ \
    --remove_unused_columns False \
    --do_train \
    --do_eval \
    --push_to_hub \
    --push_to_hub_model_id segformer-finetuned-sidewalk-10k-steps \
    --max_steps 10000 \
    --learning_rate 0.00006 \
    --lr_scheduler_type polynomial \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --logging_strategy steps \
    --logging_steps 100 \
    --eval_strategy epoch \
    --save_strategy epoch \
    --seed 1337

To enable DDP, the command needs to be modified as follows:

accelerate launch run_semantic_segmentation.py \
    --model_name_or_path nvidia/mit-b0 \
    --dataset_name segments/sidewalk-semantic \
    --output_dir ./segformer_outputs/ \
    --remove_unused_columns False \
    --do_train \
    --do_eval \
    --push_to_hub \
    --push_to_hub_model_id segformer-finetuned-sidewalk-10k-steps \
    --max_steps 10000 \
    --learning_rate 0.00006 \
    --lr_scheduler_type polynomial \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --logging_strategy steps \
    --logging_steps 100 \
    --eval_strategy epoch \
    --save_strategy epoch \
    --seed 1337

While this might be obvious to experienced users, it can be misleading for new users like me, as the default command seems to imply it works efficiently across any number of GPUs.

Your contribution

To address this, we could include a note or alert in the README.md, highlighting that to use DDP with the Trainer, it is necessary to replace python with accelerate launch, torchrun, or another distributed launcher. This would greatly improve clarity for beginners and help avoid confusion.

@caojiaolong caojiaolong added the Feature request Request for a new feature label Jan 13, 2025
@Rocketknight1
Copy link
Member

cc @muellerzr @SunMarc

@muellerzr
Copy link
Contributor

I can agree with this :) (Or just state to use accelerate launch, since it will work on single GPU as well). Would you like to take a stab at it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

3 participants