Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pytorch 2.5 & torchtune 0.3+ #315

Open
wants to merge 38 commits into
base: master
Choose a base branch
from
Open

Pytorch 2.5 & torchtune 0.3+ #315

wants to merge 38 commits into from

Conversation

Delaunay
Copy link
Collaborator

No description provided.

@Delaunay Delaunay changed the title Staging Pytorch 2.5 Nov 22, 2024
@Delaunay Delaunay changed the title Pytorch 2.5 Pytorch 2.5 & torchtune 0.3+ Nov 22, 2024
@Delaunay Delaunay marked this pull request as ready for review January 16, 2025 20:02
@Delaunay
Copy link
Collaborator Author

=================
Benchmark results
=================

System
------
cpu:      Intel(R) Xeon(R) Gold 5418Y
n_cpu:    48
product:  NVIDIA L40S
n_gpu:    4
memory:   46068.0

Breakdown
---------
bench                    | fail |   n | ngpu |           perf |   sem% |   std% | peak_memory |          score | weight
brax                     |    0 |   1 |    4 |     1018714.82 |   0.1% |   0.5% |        1312 |     1018714.82 |   1.00
diffusion-gpus           |    1 |   1 |    4 |            nan |   nan% |   nan% |       28760 |            nan |   1.00
diffusion-single         |    4 |   4 |    1 |            nan |   nan% |   nan% |         nan |            nan |   0.00
dimenet                  |    0 |   4 |    1 |         556.91 |   0.8% |  12.1% |        3850 |        2252.87 |   1.00
dinov2-giant-gpus        |    1 |   1 |    4 |            nan |   nan% |   nan% |       22954 |            nan |   1.00
dinov2-giant-single      |    4 |   4 |    1 |            nan |   nan% |   nan% |        7878 |            nan |   0.00
dqn                      |    0 |   4 |    1 | 24104883766.33 |   1.6% |  90.6% |        1322 | 96296597893.65 |   0.00
bf16                     |    0 |   4 |    1 |         280.06 |   0.2% |   4.6% |        1278 |        1124.35 |   0.00
fp16                     |    0 |   4 |    1 |         275.07 |   0.2% |   2.8% |        1278 |        1102.02 |   0.00
fp32                     |    0 |   4 |    1 |          48.63 |   0.1% |   2.5% |        1656 |         194.48 |   0.00
tf32                     |    0 |   4 |    1 |         139.59 |   0.1% |   2.7% |        1656 |         558.79 |   0.00
bert-fp16                |    0 |   4 |    1 |         206.74 |   0.9% |  10.4% |         nan |         840.32 |   0.00
bert-fp32                |    0 |   4 |    1 |          71.81 |   0.4% |   4.9% |       20660 |         289.38 |   0.00
bert-tf32                |    0 |   4 |    1 |         119.53 |   0.6% |   7.1% |       20660 |         483.44 |   0.00
bert-tf32-fp16           |    0 |   4 |    1 |         207.24 |   0.9% |  10.4% |         nan |         842.42 |   1.00
reformer                 |    0 |   4 |    1 |          29.71 |   0.2% |   2.9% |       12940 |         119.24 |   1.00
t5                       |    0 |   4 |    1 |          30.78 |   0.3% |   4.5% |       33876 |         123.74 |   0.00
whisper                  |    0 |   4 |    1 |         425.06 |   0.5% |   8.2% |        8724 |        1715.05 |   0.00
lightning                |    0 |   4 |    1 |         510.27 |   0.4% |   7.5% |       25808 |        2054.43 |   0.00
lightning-gpus           |    0 |   2 |    4 |        2003.68 |   0.4% |   6.2% |       26198 |        2003.68 |   1.00
llava-single             |    4 |   4 |    1 |            nan |   nan% |   nan% |       11064 |            nan |   1.00
llama                    |    0 |   4 |    1 |         295.56 |   6.9% |  87.7% |       27202 |        1119.03 |   1.00
llm-full-mp-gpus         |    0 |   1 |    4 |          33.13 |   3.5% |  18.4% |       30918 |          33.13 |   1.00
llm-lora-ddp-gpus        |    1 |   1 |    4 |            nan |   nan% |   nan% |         nan |            nan |   1.00
llm-lora-mp-gpus         |    1 |   1 |    4 |            nan |   nan% |   nan% |         nan |            nan |   1.00
llm-lora-single          |    4 |   4 |    1 |            nan |   nan% |   nan% |         nan |            nan |   1.00
pna                      |    0 |   4 |    1 |        4350.11 |   0.5% |   7.7% |       39200 |       17412.14 |   1.00
ppo                      |    0 |   4 |    1 |    60873776.74 |   0.7% |  58.0% |         978 |   243494498.61 |   1.00
recursiongfn             |    0 |   4 |    1 |       10054.17 |   2.3% |  35.8% |        6702 |       40477.55 |   1.00
rlhf-gpus                |    1 |   1 |    4 |            nan |   nan% |   nan% |         nan |            nan |   0.00
rlhf-single              |    4 |   4 |    1 |            nan |   nan% |   nan% |         nan |            nan |   1.00
focalnet                 |    0 |   4 |    1 |         366.37 |   0.7% |  10.6% |       23038 |        1482.12 |   0.00
torchatari               |    0 |   4 |    1 |        7479.27 |   0.7% |  10.3% |        3264 |       29862.23 |   1.00
convnext_large-fp16      |    0 |   4 |    1 |         276.93 |   1.1% |  12.1% |         nan |        1128.72 |   0.00
convnext_large-fp32      |    0 |   4 |    1 |          70.28 |   0.6% |   6.5% |       44910 |         283.89 |   0.00
convnext_large-tf32      |    0 |   4 |    1 |         119.20 |   1.0% |  11.2% |       45502 |         483.95 |   0.00
convnext_large-tf32-fp16 |    0 |   4 |    1 |         276.49 |   1.1% |  12.1% |         nan |        1126.82 |   1.00
regnet_y_128gf           |    0 |   4 |    1 |          91.04 |   0.4% |   6.6% |       28810 |         366.72 |   1.00
resnet152-ddp-gpus       |    0 |   1 |    4 |        2020.81 |   0.0% |   0.3% |       25994 |        2020.81 |   0.00
resnet50                 |    0 |   4 |    1 |         906.37 |   0.7% |  10.2% |       13868 |        3666.19 |   1.00
resnet50-noio            |    0 |   4 |    1 |         861.21 |   0.0% |   1.6% |       26884 |        3446.88 |   0.00
vjepa-gpus               |    1 |   1 |    4 |            nan |   nan% |   nan% |         nan |            nan |   1.00
vjepa-single             |    4 |   4 |    1 |            nan |   nan% |   nan% |         nan |            nan |   1.00

Scores
------
Failure rate:      20.98% (FAIL)
Score:             204.45

Errors
------
30 errors, details in HTML report.

@Delaunay
Copy link
Collaborator Author

vjepa-gpus               |    4 |   5 |    4 |           8.71 |   0.2% |   1.8% |       23328 |           1.74 |   1.00
vjepa-single             |    4 |   8 |    1 |           3.91 |   0.9% |  13.8% |         nan |           7.91 |   1.00

@Delaunay
Copy link
Collaborator Author

rlhf-gpus                |    5 |   6 |    4 |         140.15 |   1.0% |   8.4% |        8296 |          23.36 |   0.00
rlhf-single              |    4 |   8 |    1 |          46.97 |   0.9% |  14.6% |        8228 |          94.05 |   1.00

@Delaunay
Copy link
Collaborator Author

diffusion-gpus           |    2 |   3 |    4 |           8.41 |   0.1% |   0.9% |       28748 |           2.80 |   1.00
diffusion-single         |    4 |   8 |    1 |           5.19 |   0.6% |   9.9% |       18818 |          10.49 |   0.00

@Delaunay
Copy link
Collaborator Author

dinov2-giant-gpus        |    1 |   2 |    4 |          23.10 |   1.3% |  10.0% |       30258 |          11.55 |   1.00
dinov2-giant-single      |    4 |   8 |    1 |           6.46 |   0.8% |  12.9% |       28604 |          13.04 |   0.00

@Delaunay
Copy link
Collaborator Author

llava out of memory

llava-single.D3
===============
  * no training rate retrieved
  * Error codes = 1
  * 1 exceptions found
    * 1 x torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacity of 44.64 GiB of which 54.25 MiB is free. Including non-PyTorch memory, this process has 44.58 GiB memory in use. Of the allocated memory 43.68 GiB is allocated by PyTorch, and 393.78 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
        | Traceback (most recent call last):
        |   File "/network/scratch/d/delaunap/shared/milabench/benchmarks/llava/main.py", line 151, in <module>
        |     main()
        |   File "/network/scratch/d/delaunap/shared/milabench/benchmarks/llava/main.py", line 134, in main
        |     optimizer.step()
        |   File "/tmp/workspace/venv/torch/lib/python3.10/site-packages/accelerate/optimizer.py", line 178, in step
        |     self.optimizer.step(closure)
        |   File "/tmp/workspace/venv/torch/lib/python3.10/site-packages/torch/optim/optimizer.py", line 487, in wrapper
        |     out = func(*args, **kwargs)
        |   File "/tmp/workspace/venv/torch/lib/python3.10/site-packages/torch/optim/optimizer.py", line 91, in _use_grad
        |     ret = func(self, *args, **kwargs)
        |   File "/tmp/workspace/venv/torch/lib/python3.10/site-packages/torch/optim/adamw.py", line 209, in step
        |     has_complex = self._init_group(
        |   File "/tmp/workspace/venv/torch/lib/python3.10/site-packages/torch/optim/adamw.py", line 148, in _init_group
        |     state["exp_avg"] = torch.zeros_like(
        | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacity of 44.64 GiB of which 54.25 MiB is free. Including non-PyTorch memory, this process has 44.58 GiB memory in use. Of the allocated memory 43.68 GiB is allocated by PyTorch, and 393.78 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

@Delaunay
Copy link
Collaborator Author

llm-lora-ddp-gpus |    0 |   1 |    4 |     552.09 |   1.5% |   7.8% |       10288 |     552.09 |   1.00

@Delaunay
Copy link
Collaborator Author

llm-lora-mp-gpus out of memory

@Delaunay
Copy link
Collaborator Author

Breakdown
---------
bench           | fail |   n | ngpu |       perf |   sem% |   std% | peak_memory |      score | weight
llm-lora-single |    0 |   1 |    1 |    1083.89 |   2.9% |  15.4% |       17364 |    1083.89 |   1.00

@Delaunay
Copy link
Collaborator Author

llm-lora-mp-gpus |    0 |   1 |    4 |      26.22 |   3.8% |  20.4% |       13842 |      26.22 |   1.00

@Delaunay
Copy link
Collaborator Author

with BS=1

Breakdown
---------
bench                    | fail |   n | ngpu |           perf |   sem% |   std% | peak_memory |           score | weight
brax                     |    1 |   1 |    4 |            nan |   nan% |   nan% |         nan |             nan |   1.00
diffusion-gpus           |    0 |   1 |    4 |           8.42 |   0.1% |   1.1% |       22076 |            8.42 |   1.00
diffusion-single         |    0 |   4 |    1 |           5.18 |   0.6% |   9.8% |       18818 |           20.95 |   0.00
dimenet                  |    4 |   4 |    1 |            nan |   nan% |   nan% |         nan |             nan |   1.00
dinov2-giant-gpus        |    0 |   1 |    4 |          23.42 |   1.2% |   9.3% |       30394 |           23.42 |   1.00
dinov2-giant-single      |    0 |   4 |    1 |           6.54 |   0.6% |  10.1% |       28604 |           26.38 |   0.00
dqn                      |    0 |   4 |    1 | 25110752643.15 |   1.6% |  90.0% |         874 | 100339640366.34 |   0.00
bf16                     |    0 |   4 |    1 |         279.25 |   0.2% |   4.4% |        1278 |         1121.01 |   0.00
fp16                     |    0 |   4 |    1 |         274.61 |   0.1% |   2.5% |        1278 |         1100.14 |   0.00
fp32                     |    0 |   4 |    1 |          48.41 |   0.1% |   1.6% |        1656 |          193.63 |   0.00
tf32                     |    0 |   4 |    1 |         138.95 |   0.1% |   2.5% |        1656 |          556.22 |   0.00
bert-fp16                |    0 |   4 |    1 |          31.34 |   1.3% |  14.6% |         nan |          128.18 |   0.00
bert-fp32                |    0 |   4 |    1 |          30.09 |   1.3% |  13.8% |         nan |          122.98 |   0.00
bert-tf32                |    0 |   4 |    1 |          36.53 |   1.3% |  14.9% |         nan |          149.39 |   0.00
bert-tf32-fp16           |    0 |   4 |    1 |          31.35 |   1.4% |  15.0% |         nan |          128.22 |   1.00
reformer                 |    0 |   4 |    1 |          33.80 |   0.7% |  11.2% |         nan |          136.72 |   1.00
t5                       |    0 |   4 |    1 |          24.01 |   0.7% |  11.2% |         nan |           97.11 |   0.00
whisper                  |    0 |   4 |    1 |         121.26 |   0.9% |  13.4% |         nan |          490.46 |   0.00
lightning                |    0 |   4 |    1 |          27.17 |   0.5% |   9.7% |         nan |          109.42 |   0.00
lightning-gpus           |    0 |   1 |    4 |          78.32 |   0.5% |   4.8% |        2784 |           78.32 |   1.00
llava-single             |    4 |   4 |    1 |            nan |   nan% |   nan% |       10104 |             nan |   1.00
llama                    |    0 |   4 |    1 |         302.48 |   7.0% |  89.6% |       27202 |         1147.42 |   1.00
llm-full-mp-gpus         |    0 |   1 |    4 |          13.11 |   4.0% |  21.2% |       29132 |           13.11 |   1.00
llm-lora-ddp-gpus        |    0 |   1 |    4 |         551.36 |   1.5% |   7.8% |       10288 |          551.36 |   1.00
llm-lora-mp-gpus         |    0 |   1 |    4 |          26.10 |   3.8% |  20.3% |       13842 |           26.10 |   1.00
llm-lora-single          |    0 |   4 |    1 |        1057.08 |   1.5% |  16.8% |       17364 |         4236.05 |   1.00
pna                      |    4 |   4 |    1 |            nan |   nan% |   nan% |         nan |             nan |   1.00
ppo                      |    4 |   4 |    1 |    48745723.11 |   0.3% |  58.4% |         978 |            0.00 |   1.00
recursiongfn             |    4 |   4 |    1 |            nan |   nan% |   nan% |         nan |             nan |   1.00
rlhf-gpus                |    0 |   1 |    4 |         132.65 |   1.0% |   8.1% |        9444 |          132.65 |   0.00
rlhf-single              |    0 |   4 |    1 |          43.12 |   0.8% |  14.1% |        7704 |          172.71 |   1.00
focalnet                 |    0 |   4 |    1 |          17.30 |   0.8% |  13.0% |         nan |           69.94 |   0.00
torchatari               |    0 |   4 |    1 |         624.83 |   0.6% |   9.2% |        1012 |         2518.79 |   1.00
convnext_large-fp16      |    0 |   4 |    1 |          29.92 |   1.5% |  16.2% |         nan |          122.02 |   0.00
convnext_large-fp32      |    0 |   4 |    1 |          41.17 |   1.4% |  15.5% |         nan |          168.37 |   0.00
convnext_large-tf32      |    0 |   4 |    1 |          43.73 |   1.5% |  16.2% |         nan |          178.93 |   0.00
convnext_large-tf32-fp16 |    0 |   4 |    1 |          29.44 |   1.5% |  16.6% |         nan |          120.14 |   1.00
regnet_y_128gf           |    0 |   4 |    1 |          17.15 |   0.8% |  12.9% |        1176 |           69.15 |   1.00
resnet152-ddp-gpus       |    0 |   1 |    4 |          94.74 |   0.2% |   1.7% |        1946 |           94.74 |   0.00
resnet50                 |    0 |   4 |    1 |          80.56 |   0.8% |  12.4% |         nan |          326.02 |   1.00
resnet50-noio            |    0 |   4 |    1 |          86.47 |   0.1% |   5.5% |        1226 |          346.35 |   0.00
vjepa-gpus               |    0 |   1 |    4 |           8.59 |   0.4% |   3.2% |       23428 |            8.59 |   1.00
vjepa-single             |    0 |   4 |    1 |           3.94 |   0.7% |  11.4% |        5970 |           15.95 |   1.00

Scores
------
Failure rate:      14.79% (FAIL)
Score:              32.65

Errors
------
21 errors, details in HTML report.

@Delaunay
Copy link
Collaborator Author

bench                    | fail |   n | ngpu |           perf |   sem% |   std% | peak_memory |          score | weight
brax                     |    0 |   1 |    4 |     1027148.08 |   0.0% |   0.1% |        1312 |     1027148.08 |   1.00
diffusion-gpus           |    1 |   1 |    4 |            nan |   nan% |   nan% |         nan |            nan |   1.00
diffusion-single         |    4 |   4 |    1 |            nan |   nan% |   nan% |         nan |            nan |   0.00
dimenet                  |    0 |   4 |    1 |         540.40 |   0.8% |  11.8% |        2674 |        2186.39 |   1.00
dinov2-giant-gpus        |    1 |   1 |    4 |            nan |   nan% |   nan% |       24856 |            nan |   1.00
dinov2-giant-single      |    4 |   4 |    1 |            nan |   nan% |   nan% |        7066 |            nan |   0.00
dqn                      |    0 |   4 |    1 | 23531418779.61 |   1.6% |  90.1% |        1354 | 94008285414.56 |   0.00
bf16                     |    0 |   4 |    1 |         279.79 |   0.2% |   4.5% |        1278 |        1123.16 |   0.00
fp16                     |    0 |   4 |    1 |         276.15 |   0.1% |   2.7% |        1278 |        1106.40 |   0.00
fp32                     |    0 |   4 |    1 |          48.62 |   0.1% |   1.8% |        1656 |         194.45 |   0.00
tf32                     |    0 |   4 |    1 |         139.48 |   0.1% |   2.7% |        1656 |         558.38 |   0.00
bert-fp16                |    0 |   4 |    1 |         206.96 |   0.9% |  10.5% |         nan |         841.39 |   0.00
bert-fp32                |    0 |   4 |    1 |          71.61 |   0.5% |   5.1% |       20660 |         288.57 |   0.00
bert-tf32                |    0 |   4 |    1 |         119.40 |   0.7% |   7.2% |       20660 |         482.84 |   0.00
bert-tf32-fp16           |    0 |   4 |    1 |         206.83 |   0.9% |  10.4% |         nan |         840.69 |   1.00
reformer                 |    0 |   4 |    1 |          29.71 |   0.2% |   3.0% |       12940 |         119.23 |   1.00
t5                       |    0 |   4 |    1 |          30.79 |   0.3% |   4.6% |       33876 |         123.79 |   0.00
whisper                  |    0 |   4 |    1 |         425.00 |   0.5% |   8.3% |         nan |        1715.32 |   0.00
lightning                |    0 |   4 |    1 |         510.22 |   0.4% |   7.5% |       25808 |        2054.12 |   0.00
lightning-gpus           |    0 |   1 |    4 |        2014.40 |   0.0% |   0.4% |       26198 |        2014.40 |   1.00
llava-single             |    4 |   4 |    1 |            nan |   nan% |   nan% |       14280 |            nan |   1.00
llama                    |    0 |   4 |    1 |         295.91 |   6.9% |  87.7% |       27202 |        1119.81 |   1.00
llm-full-mp-gpus         |    0 |   1 |    4 |          31.34 |   3.5% |  18.3% |       25208 |          31.34 |   1.00
llm-lora-ddp-gpus        |    0 |   1 |    4 |        5247.18 |   0.4% |   2.1% |       32870 |        5247.18 |   1.00
llm-lora-mp-gpus         |    0 |   1 |    4 |         353.28 |   2.0% |  10.7% |       19166 |         353.28 |   1.00
llm-lora-single          |    0 |   4 |    1 |        2342.46 |   0.1% |   0.9% |       31112 |        9368.48 |   1.00
pna                      |    0 |   4 |    1 |        4386.70 |   0.5% |   7.5% |       39214 |       17557.84 |   1.00
ppo                      |    0 |   4 |    1 |    60533209.59 |   0.7% |  57.9% |         978 |   242133098.76 |   1.00
recursiongfn             |    0 |   4 |    1 |       12721.01 |   1.4% |  21.5% |        8154 |       51218.08 |   1.00
rlhf-gpus                |    0 |   1 |    4 |        6411.79 |   0.3% |   2.3% |       20398 |        6411.79 |   0.00
rlhf-single              |    0 |   4 |    1 |        1828.87 |   0.2% |   3.1% |       19128 |        7323.80 |   1.00
focalnet                 |    0 |   4 |    1 |         366.11 |   0.7% |  10.6% |       23038 |        1481.03 |   0.00
torchatari               |    0 |   4 |    1 |        7487.28 |   0.6% |   9.6% |        3264 |       29872.98 |   1.00
convnext_large-fp16      |    0 |   4 |    1 |         276.36 |   1.1% |  12.2% |         nan |        1126.43 |   0.00
convnext_large-fp32      |    0 |   4 |    1 |          70.17 |   0.6% |   7.0% |       44910 |         283.70 |   0.00
convnext_large-tf32      |    0 |   4 |    1 |         119.13 |   1.0% |  11.2% |       45502 |         483.66 |   0.00
convnext_large-tf32-fp16 |    0 |   4 |    1 |         276.51 |   1.1% |  12.1% |         nan |        1126.96 |   1.00
regnet_y_128gf           |    0 |   4 |    1 |          90.97 |   0.4% |   6.8% |       28810 |         366.50 |   1.00
resnet152-ddp-gpus       |    0 |   1 |    4 |        2018.87 |   0.2% |   1.8% |       25994 |        2018.87 |   0.00
resnet50                 |    0 |   4 |    1 |         904.41 |   0.7% |  10.4% |       13868 |        3658.59 |   1.00
resnet50-noio            |    0 |   4 |    1 |         861.34 |   0.0% |   1.6% |       26884 |        3447.39 |   0.00
vjepa-gpus               |    1 |   1 |    4 |            nan |   nan% |   nan% |       45900 |            nan |   1.00
vjepa-single             |    4 |   4 |    1 |            nan |   nan% |   nan% |        2910 |            nan |   1.00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants