-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Multinode docker container #2817
Comments
New Dockerfile with Full Multi-Node Support
This Dockerfile is based on the image from the NVIDIA NGC catalog: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver , which includes multi-node support. The only changes made to the original DockerFile are the replacement of the base version and the installation of the latest version of SGLang. |
I’ll help take a look. Thanks. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Checklist
Motivation
I am encountering an issue where InfiniBand is not being fully utilized during multi-node deployment of DeepSeek v3. Upon investigation, I discovered that the current base Docker image being used is https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda-dl-base, which explicitly states in its description that it does not support multi-node configurations.
I attempted to switch to an alternative base image, https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch, but so far, I have not been successful in resolving the issue. Once I achieve a working solution, I will share the corresponding Dockerfile.
In the meantime, I would like to inquire if you are aware of a suitable base image that could replace the current one to ensure proper support for multi-node inference.
Related resources
No response
The text was updated successfully, but these errors were encountered: