Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New instance ignores aws-neuron/optimum-neuron-cache and still compiles model #271

Closed
kct22aws opened this issue Oct 24, 2023 · 6 comments
Closed

Comments

@kct22aws
Copy link

By using CUSTOM_CACHE_REPO="aws-neuron/optimum-neuron-cache" torchrun..... as in the image classification example, I can push model to the repo. However, when in another instance, when I ran the same command and same training script as shown, it still proceed to compile the model. It seems Trainer is ignoring to look at the cache repo.

@kct22aws kct22aws changed the title New instance ignores aws-neuron/optimum-neuron-cache and still compile model New instance ignores aws-neuron/optimum-neuron-cache and still compiles model Oct 24, 2023
@michaelbenayoun
Copy link
Member

Alright, as a quick fix: can you try setting the cache repo with the CLI:

optimum-cli neuron cache set aws-neuron/optimum-neuron-cache

Also, are you logged in? If you're not logged in or do not have writing rights on this repo it will not push anything.

@5cp
Copy link
Contributor

5cp commented Oct 31, 2023

I was able to reproduce the issue with optimum-neuron 0.0.12 and Neuron 2.14.

The ON ViT training job finds and attempts to download the files from the hub cache, but there are already 2 graphs that have triggered JiT compilation at that time:

INFO|trainer.py:1760] 2023-10-31 00:25:19,639 >> ***** Running training *****
[INFO|trainer.py:1761] 2023-10-31 00:25:19,639 >>   Num examples = 75,760
[INFO|trainer.py:1762] 2023-10-31 00:25:19,639 >>   Num Epochs = 1
[INFO|trainer.py:1763] 2023-10-31 00:25:19,639 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:1766] 2023-10-31 00:25:19,639 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:1767] 2023-10-31 00:25:19,639 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1768] 2023-10-31 00:25:19,639 >>   Total optimization steps = 4,735
[INFO|trainer.py:1769] 2023-10-31 00:25:19,639 >>   Number of trainable parameters = 85,876,325
^M  0%|          | 0/4735 [00:00<?, ?it/s]
2023-10-31 00:25:19.000786:  45883  INFO ||NEURON_CACHE||: Compile cache path: /tmp/tmp7428jxhm/neuron-compile-cache
2023-10-31 00:25:19.000788:  45883  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/no-user/neuroncc_compile_
workdir/ec3ef836-a1f0-422b-9086-390f2231f9da/model.MODULE_8373890936584923875+d41d8cd9.hlo.pb', '--output', '/tmp/no-user/neuroncc_compile_workdir/ec3ef836-a1f0-422b-9086-390f2231f9da/
model.MODULE_8373890936584923875+d41d8cd9.neff', '--verbose=35']
.
Compiler status PASS
2023-10-31 00:25:22.000323:  46236  INFO ||NEURON_CACHE||: Compile cache path: /tmp/tmp7428jxhm/neuron-compile-cache
2023-10-31 00:25:22.000325:  46236  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/no-user/neuroncc_compile_
workdir/f6725b48-f33a-4542-9972-83ac6cb62245/model.MODULE_6998451530702978255+d41d8cd9.hlo.pb', '--output', '/tmp/no-user/neuroncc_compile_workdir/f6725b48-f33a-4542-9972-83ac6cb62245/
model.MODULE_6998451530702978255+d41d8cd9.neff', '--verbose=35']
.
No Neuron cache name is saved locally. This means that only the official Neuron cache, and potentially a cache defined in $CUSTOM_CACHE_REPO will be used. You can create a Neuron cach
e repo by running the following command: `optimum-cli neuron cache create`. If the Neuron cache already exists you can set it by running the following command: `optimum-cli neuron cach
e set -n [name]`.
You do not have write access to aws-neuron/optimum-neuron-cache so you will not be able to push any cached compilation files. Please log in and/or use a custom Neuron cache.

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

MODULE_8373890936584923875+d41d8cd9 and MODULE_6998451530702978255+d41d8cd9 are part of the cached artifacts in the hub cache.

It seems like the fetch isn't happening early enough in the training job, and it's leading to unwanted compilations.

@5cp
Copy link
Contributor

5cp commented Oct 31, 2023

The issue appears to be related to this xm.rendezvous which gates the workers. The rendezvous isn't taking place because wait_for_everyone_on_fetch=False. This allows the non-rank0 workers to proceed before the cached files are fetched (resulting in JiT compilation).

Setting wait_for_everyone_on_fetch=True seems to resolve the issue.

@kct22aws
Copy link
Author

This issue is resolved with fix by #280

@dacorvo
Copy link
Collaborator

dacorvo commented Nov 13, 2023

Feel free to close the issue is the problem is solved.

@michaelbenayoun
Copy link
Member

Closing the issue, feel free to re-open if the issue persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants