New instance ignores aws-neuron/optimum-neuron-cache and still compiles model #271

kct22aws · 2023-10-24T22:28:01Z

By using CUSTOM_CACHE_REPO="aws-neuron/optimum-neuron-cache" torchrun..... as in the image classification example, I can push model to the repo. However, when in another instance, when I ran the same command and same training script as shown, it still proceed to compile the model. It seems Trainer is ignoring to look at the cache repo.

The text was updated successfully, but these errors were encountered:

michaelbenayoun · 2023-10-26T16:57:28Z

Alright, as a quick fix: can you try setting the cache repo with the CLI:

optimum-cli neuron cache set aws-neuron/optimum-neuron-cache

Also, are you logged in? If you're not logged in or do not have writing rights on this repo it will not push anything.

5cp · 2023-10-31T00:35:51Z

I was able to reproduce the issue with optimum-neuron 0.0.12 and Neuron 2.14.

The ON ViT training job finds and attempts to download the files from the hub cache, but there are already 2 graphs that have triggered JiT compilation at that time:

INFO|trainer.py:1760] 2023-10-31 00:25:19,639 >> ***** Running training *****
[INFO|trainer.py:1761] 2023-10-31 00:25:19,639 >>   Num examples = 75,760
[INFO|trainer.py:1762] 2023-10-31 00:25:19,639 >>   Num Epochs = 1
[INFO|trainer.py:1763] 2023-10-31 00:25:19,639 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:1766] 2023-10-31 00:25:19,639 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:1767] 2023-10-31 00:25:19,639 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1768] 2023-10-31 00:25:19,639 >>   Total optimization steps = 4,735
[INFO|trainer.py:1769] 2023-10-31 00:25:19,639 >>   Number of trainable parameters = 85,876,325
^M  0%|          | 0/4735 [00:00<?, ?it/s]
2023-10-31 00:25:19.000786:  45883  INFO ||NEURON_CACHE||: Compile cache path: /tmp/tmp7428jxhm/neuron-compile-cache
2023-10-31 00:25:19.000788:  45883  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/no-user/neuroncc_compile_
workdir/ec3ef836-a1f0-422b-9086-390f2231f9da/model.MODULE_8373890936584923875+d41d8cd9.hlo.pb', '--output', '/tmp/no-user/neuroncc_compile_workdir/ec3ef836-a1f0-422b-9086-390f2231f9da/
model.MODULE_8373890936584923875+d41d8cd9.neff', '--verbose=35']
.
Compiler status PASS
2023-10-31 00:25:22.000323:  46236  INFO ||NEURON_CACHE||: Compile cache path: /tmp/tmp7428jxhm/neuron-compile-cache
2023-10-31 00:25:22.000325:  46236  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/no-user/neuroncc_compile_
workdir/f6725b48-f33a-4542-9972-83ac6cb62245/model.MODULE_6998451530702978255+d41d8cd9.hlo.pb', '--output', '/tmp/no-user/neuroncc_compile_workdir/f6725b48-f33a-4542-9972-83ac6cb62245/
model.MODULE_6998451530702978255+d41d8cd9.neff', '--verbose=35']
.
No Neuron cache name is saved locally. This means that only the official Neuron cache, and potentially a cache defined in $CUSTOM_CACHE_REPO will be used. You can create a Neuron cach
e repo by running the following command: `optimum-cli neuron cache create`. If the Neuron cache already exists you can set it by running the following command: `optimum-cli neuron cach
e set -n [name]`.
You do not have write access to aws-neuron/optimum-neuron-cache so you will not be able to push any cached compilation files. Please log in and/or use a custom Neuron cache.

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

MODULE_8373890936584923875+d41d8cd9 and MODULE_6998451530702978255+d41d8cd9 are part of the cached artifacts in the hub cache.

It seems like the fetch isn't happening early enough in the training job, and it's leading to unwanted compilations.

5cp · 2023-10-31T18:21:54Z

The issue appears to be related to this xm.rendezvous which gates the workers. The rendezvous isn't taking place because wait_for_everyone_on_fetch=False. This allows the non-rank0 workers to proceed before the cached files are fetched (resulting in JiT compilation).

Setting wait_for_everyone_on_fetch=True seems to resolve the issue.

kct22aws · 2023-10-31T19:05:30Z

This issue is resolved with fix by #280

dacorvo · 2023-11-13T13:32:52Z

Feel free to close the issue is the problem is solved.

michaelbenayoun · 2023-11-14T10:26:32Z

Closing the issue, feel free to re-open if the issue persists.

kct22aws changed the title ~~New instance ignores aws-neuron/optimum-neuron-cache and still compile model~~ New instance ignores aws-neuron/optimum-neuron-cache and still compiles model Oct 24, 2023

michaelbenayoun mentioned this issue Oct 31, 2023

Fix neuron cache starting compilation before fetching #280

Merged

michaelbenayoun closed this as completed Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New instance ignores aws-neuron/optimum-neuron-cache and still compiles model #271

New instance ignores aws-neuron/optimum-neuron-cache and still compiles model #271

kct22aws commented Oct 24, 2023

michaelbenayoun commented Oct 26, 2023

5cp commented Oct 31, 2023

5cp commented Oct 31, 2023

kct22aws commented Oct 31, 2023

dacorvo commented Nov 13, 2023

michaelbenayoun commented Nov 14, 2023

New instance ignores aws-neuron/optimum-neuron-cache and still compiles model #271

New instance ignores aws-neuron/optimum-neuron-cache and still compiles model #271

Comments

kct22aws commented Oct 24, 2023

michaelbenayoun commented Oct 26, 2023

5cp commented Oct 31, 2023

5cp commented Oct 31, 2023

kct22aws commented Oct 31, 2023

dacorvo commented Nov 13, 2023

michaelbenayoun commented Nov 14, 2023