-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG : memory_stats() is not supported in TPU pod causing the inference in TPU pod to throw an error #181
Comments
thanks for reporting the bug ill make a commit and check the inference ASAP |
Hey @salrowili can you confirm that the code now works just fine? |
I have tested the recent update on TPUv4-8 . following this commit 1caafc3 , there is a new bug :
if I roll back to the previous commit 1045c25, the inference code will work ok. I have also noticed that in the inference example EasyDeL/tests/vinference_test.py Line 32 in ba6fddf
With FSDP sharding (1,1,1,-1) accuracy will drop to 31 with more than one minute and a half for the running time. Feel free to post this example in the docs folder, because I noticed that many have asked to show more tutorials for inference and evaluation scripts (: |
Thank you! I appreciate the detailed insights. At the moment, the HEAD commit (b9c4bd2) is working fine, so the bug seems to be resolved. Regarding sharding methods, I typically prefer Sequence Sharding over Tensor Parallelism, as it tends to perform better on GPUs. Your example script and the explanation of the impact of sharding configurations are incredibly helpful, especially for optimizing inference. It might include this in the documentation as a reference for others. |
@salrowili Did you ever run inference server on a multinode TPU pod like v4-64 ? |
Hi @creatorrr, I have managed to run inference on TPUv4-32 using sharding method of (1,1,4,4) with 0.0.80 version but the speed was not significantly higher than TPUv4-8. For SFT training, the best setting for me on TPUv-32 was (4,1,1,4). However, it was convenient for me to run inference on TPUv4-32 even though the speed was not better, because at least i can finetune and evaluate the model on the same TPU pod machine without create a separate TPUv4-8 machine for inference. I suggest also that you better to compare the result from TPUv4-32 inference with TPUv4-8 (1,1,-1,1) sharding method, because a different sharding setting (e.g, (1,1,4,4) can yield to poor performance. TPUv4-8 (1,1,-1,1) gives me almost an identical performance to GPU inference with TRL repo. If (1,1,4,8), (1,1,8,4) or (1,4,4,4) did not work with you on TPUv-64, your best bet is to change the topology of TPU POD itself when you create it through -- topology=.. flag. See https://cloud.google.com/tpu/docs/v4 . If your intention from running inference on TPUv-64 is to run large models, just to let you know, i have managed to run 70b model on TPUv4-8 with A8BIT flag on 0.0.80 version using this script :
Do not forget to remove memory state from every machine in the pod :
You can execute this command on all workers in the pod
I also suggest for @erfanzar , that we create a new page in docs folder that discuss and list the best practice for each TPU setting and compare those setting in term of speed and performance for finetuning and inference tasks. |
@salrowili Absolutely, that's a great suggestion! I'd be happy to collaborate on documenting TPU best practices. Would you be open to discussing this further over Discord? We could combine our experiences with different sharding approaches, including custom configurations, to create a comprehensive guide. |
Great idea @erfanzar . I am in. However, i suggest that we can open a topic for discussion here. The topic will only address best practice for TPU configurations. My idea is that other will learn more from our discussions, from the point when we identify the issue till finding the solution. If we just presented the best practice for developer and researcher, they may not be convinced of the motivation behind it. So a topic to discuss the best practice here and a reference documentation in the doc file would serve the best. |
Hi,
There is a bug in the inference code. memory_stats() is only supported on TPU single hardware. memory_stats() is part of the inference metric and to work around this issue, add # before this line and apply the change for all devices in the pod.
EasyDeL/easydel/inference/vinference/metrics.py
Line 118 in 9e95b25
The text was updated successfully, but these errors were encountered: