Distributed Data Parallel Training #76

IanMagnusson · 2022-08-30T21:17:26Z

What is here

Fixes support for distributed training with data parallelism. Previously torch metrics would attempt to synchronize across processes during validation call back and would cause a crash. Also the final model output by the FinetuneStep would be on cpu rather than GPU as is the case for the non-distributed usage; now the returned model is on GPU.

Limitations

Validation is still done in a single process with how data parallelism.

Reproduction

Running with and without multiple devices produces exactly the same validation metrics, though the print out is slightly different due to tasks being copied:

python -m catwalk.train --model rc::gpt2 --task piqa --device_count 1 --batch_size 16

....
Running log-likelihood queries: 100%|##########| 2000/2000 [00:07<00:00, 270.13it/s]
Calculating metrics: 100%|##########| 1000/1000 [00:00<00:00, 4003.12it/s]85.04it/s]
Metrics for piqa: acc: 0.647######  | 804/1000 [00:00<00:00, 4020.71it/s]
Running log-likelihood queries: 100%|##########| 2000/2000 [00:07<00:00, 260.74it/s]_val_loss=3.02, val_loss=3.02]  
Calculating metrics: 100%|##########| 1000/1000 [00:00<00:00, 4000.07it/s]76.95it/s]
Metrics for piqa: acc: 0.648#####9  | 799/1000 [00:00<00:00, 3991.17it/s]
...

python -m catwalk.train --model rc::gpt2 --task piqa --device_count 2 --batch_size 16

...
Running log-likelihood queries: 100%|##########| 2000/2000 [00:07<00:00, 271.48it/s]
Calculating metrics: 100%|##########| 1000/1000 [00:00<00:00, 3683.20it/s]08.65it/s]
Metrics for <catwalk.tasks.eleuther.EleutherTask object at 0x7fe861c05490>: acc: 0.647
Running log-likelihood queries: 100%|##########| 2000/2000 [00:06<00:00, 288.04it/s]_val_loss=3.02, val_loss=3.02]  
Calculating metrics: 100%|##########| 1000/1000 [00:00<00:00, 3767.45it/s]25.77it/s]
Metrics for <catwalk.tasks.eleuther.EleutherTask object at 0x7fe861c05490>: acc: 0.648
...

dirkgr

I don't like this. In this version of the callback, I didn't have to do this. I think the trick is to make sure that each worker runs through the same data.

But also, consider that in my latest training version (which I have only in a branch as yet), I don't even have the callback anymore. Can we just get rid of that whole problem area by making sure the trainable model computes its metrics during forward()?

dirkgr · 2022-09-13T21:29:17Z

catwalk/steps.py

+            # if distributed, this model hasn't been sent to device yet
+            trainable_model.to(resolve_device())


Why do we have to return a model that's tied to a device?

Why do we have to return a model that's tied to a device?

Hmm I had a crash where not moving the model back to the GPU would cause inputs on the GPU to attempt to be processed in the CPU model during the final evaluation. I can't replicate this now thought. I suspect it was required for the failed attempt to integrate the fairscale code in #79.

As it stands here there is no crash without this code, but it does mean that the final evaluation will run on CPU rather than GPU as it would it if it were running a non-trainable model. I can't just make the TrainableRankClassificationModel.predict() use resolve_device() like the non-trainable predict() does, because invoking predict at all inside the training process would mess up the distributed device map.

Perhaps the best place to move it to GPU would be in catwalk.train.py like this?

model_step = FinetuneStep( model=args.model, tasks=tasks, batch_size=args.batch_size, grad_accum=args.grad_acc, device_count=args.device_count ) model_step = model_step.result().to(resolve_device())

IanMagnusson · 2022-09-14T03:13:57Z

I don't like this. In this version of the callback, I didn't have to do this. I think the trick is to make sure that each worker runs through the same data.

But also, consider that in my latest training version (which I have only in a branch as yet), I don't even have the callback anymore. Can we just get rid of that whole problem area by making sure the trainable model computes its metrics during forward()?

I agree, getting rid of the validation callback all together would be the best solution. I'm concerned that's a bit beyond the scope of what I can accomplish this week. All this distributed processing stuff has me mostly just feeling around in the dark because of my lack of systems background.

This PR is not a necessary dependency of the IA3 PR #81, so if it's going to be superseded by your rework of the training code in that branch then perhaps we should just skip this PR?

IanMagnusson · 2022-09-16T02:19:40Z

I've reverted theses changes in the IA3 PR #81 as they are not actually necessary for that PR, and I don't want this one to block that.

dirkgr · 2022-09-20T18:12:14Z

I will revisit this after #84 is merged.

IanMagnusson added 4 commits August 30, 2022 14:01

basic data parallel support

e5ab6ac

placate mypy

238279f

update changelog

c58b0cc

Merge branch 'main' into distributed-training

5251031

IanMagnusson marked this pull request as ready for review September 12, 2022 22:06

IanMagnusson requested a review from dirkgr September 12, 2022 22:06

IanMagnusson changed the title ~~Distributed training~~ Distributed Data Parallel Training Sep 12, 2022

dirkgr requested changes Sep 13, 2022

View reviewed changes

better place to resolve device

dc1b410

IanMagnusson marked this pull request as draft September 16, 2022 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Data Parallel Training #76

Distributed Data Parallel Training #76

IanMagnusson commented Aug 30, 2022 •

edited

Loading

dirkgr left a comment

dirkgr Sep 13, 2022

IanMagnusson Sep 14, 2022

IanMagnusson commented Sep 14, 2022

IanMagnusson commented Sep 16, 2022

dirkgr commented Sep 20, 2022

		# if distributed, this model hasn't been sent to device yet
		trainable_model.to(resolve_device())

Distributed Data Parallel Training #76

Are you sure you want to change the base?

Distributed Data Parallel Training #76

Conversation

IanMagnusson commented Aug 30, 2022 • edited Loading

What is here

Limitations

Reproduction

dirkgr left a comment

Choose a reason for hiding this comment

dirkgr Sep 13, 2022

Choose a reason for hiding this comment

IanMagnusson Sep 14, 2022

Choose a reason for hiding this comment

IanMagnusson commented Sep 14, 2022

IanMagnusson commented Sep 16, 2022

dirkgr commented Sep 20, 2022

IanMagnusson commented Aug 30, 2022 •

edited

Loading