-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New instance ignores aws-neuron/optimum-neuron-cache and still compiles model #271
Comments
Alright, as a quick fix: can you try setting the cache repo with the CLI:
Also, are you logged in? If you're not logged in or do not have writing rights on this repo it will not push anything. |
I was able to reproduce the issue with optimum-neuron 0.0.12 and Neuron 2.14. The ON ViT training job finds and attempts to download the files from the hub cache, but there are already 2 graphs that have triggered JiT compilation at that time:
MODULE_8373890936584923875+d41d8cd9 and MODULE_6998451530702978255+d41d8cd9 are part of the cached artifacts in the hub cache. It seems like the fetch isn't happening early enough in the training job, and it's leading to unwanted compilations. |
The issue appears to be related to this xm.rendezvous which gates the workers. The rendezvous isn't taking place because wait_for_everyone_on_fetch=False. This allows the non-rank0 workers to proceed before the cached files are fetched (resulting in JiT compilation). Setting wait_for_everyone_on_fetch=True seems to resolve the issue. |
This issue is resolved with fix by #280 |
Feel free to close the issue is the problem is solved. |
Closing the issue, feel free to re-open if the issue persists. |
By using
CUSTOM_CACHE_REPO="aws-neuron/optimum-neuron-cache" torchrun.....
as in the image classification example, I can push model to the repo. However, when in another instance, when I ran the same command and same training script as shown, it still proceed to compile the model. It seems Trainer is ignoring to look at the cache repo.The text was updated successfully, but these errors were encountered: