[Feature] Multinode docker container #2817

EgorovMike219 · 2025-01-09T14:27:08Z

Checklist

1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
2. Please use English, otherwise it will be closed.

Motivation

I am encountering an issue where InfiniBand is not being fully utilized during multi-node deployment of DeepSeek v3. Upon investigation, I discovered that the current base Docker image being used is https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda-dl-base, which explicitly states in its description that it does not support multi-node configurations.

I attempted to switch to an alternative base image, https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch, but so far, I have not been successful in resolving the issue. Once I achieve a working solution, I will share the corresponding Dockerfile.

In the meantime, I would like to inquire if you are aware of a suitable base image that could replace the current one to ensure proper support for multi-node inference.

Related resources

No response

EgorovMike219 · 2025-01-09T19:25:27Z

New Dockerfile with Full Multi-Node Support

ARG CUDA_VERSION=12.4.1

FROM nvcr.io/nvidia/tritonserver:24.04-py3-min

ARG BUILD_TYPE=all
ENV DEBIAN_FRONTEND=noninteractive

RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \
    && echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections \
    && apt update -y \
    && apt install software-properties-common -y \
    && add-apt-repository ppa:deadsnakes/ppa -y && apt update \
    && apt install python3.10 python3.10-dev -y \
    && update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1 \
    && update-alternatives --set python3 /usr/bin/python3.10 && apt install python3.10-distutils -y \
    && apt install curl git sudo libibverbs-dev -y \
    && curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py \
    && python3 --version \
    && python3 -m pip --version \
    && rm -rf /var/lib/apt/lists/* \
    && apt clean

# For openbmb/MiniCPM models
RUN pip3 install datamodel_code_generator

WORKDIR /sgl-workspace

ARG CUDA_VERSION
RUN python3 -m pip install --upgrade pip setuptools wheel html5lib six \
    && git clone -b v0.4.1.post4 https://github.com/sgl-project/sglang.git \
    && if [ "$CUDA_VERSION" = "12.1.1" ]; then \
         python3 -m pip install torch --index-url https://download.pytorch.org/whl/cu121; \
       elif [ "$CUDA_VERSION" = "12.4.1" ]; then \
         python3 -m pip install torch --index-url https://download.pytorch.org/whl/cu124; \
       elif [ "$CUDA_VERSION" = "11.8.0" ]; then \
         python3 -m pip install torch --index-url https://download.pytorch.org/whl/cu118; \
       else \
         echo "Unsupported CUDA version: $CUDA_VERSION" && exit 1; \
       fi \
    && cd sglang \
    && if [ "$BUILD_TYPE" = "srt" ]; then \
         if [ "$CUDA_VERSION" = "12.1.1" ]; then \
           python3 -m pip --no-cache-dir install -e "python[srt]" --find-links https://flashinfer.ai/whl/cu121/torch2.4/flashinfer/; \
         elif [ "$CUDA_VERSION" = "12.4.1" ]; then \
           python3 -m pip --no-cache-dir install -e "python[srt]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/; \
         elif [ "$CUDA_VERSION" = "11.8.0" ]; then \
           python3 -m pip --no-cache-dir install -e "python[srt]" --find-links https://flashinfer.ai/whl/cu118/torch2.4/flashinfer/; \
         else \
           echo "Unsupported CUDA version: $CUDA_VERSION" && exit 1; \
         fi; \
       else \
         if [ "$CUDA_VERSION" = "12.1.1" ]; then \
           python3 -m pip --no-cache-dir install -e "python[all]" --find-links https://flashinfer.ai/whl/cu121/torch2.4/flashinfer/; \
         elif [ "$CUDA_VERSION" = "12.4.1" ]; then \
           python3 -m pip --no-cache-dir install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/; \
         elif [ "$CUDA_VERSION" = "11.8.0" ]; then \
           python3 -m pip --no-cache-dir install -e "python[all]" --find-links https://flashinfer.ai/whl/cu118/torch2.4/flashinfer/; \
         else \
           echo "Unsupported CUDA version: $CUDA_VERSION" && exit 1; \
         fi; \
       fi

ENV DEBIAN_FRONTEND=interactive

This Dockerfile is based on the image from the NVIDIA NGC catalog: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver , which includes multi-node support. The only changes made to the original DockerFile are the replacement of the base version and the installation of the latest version of SGLang.

zhyncs · 2025-01-10T02:38:16Z

I’ll help take a look. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Multinode docker container #2817

[Feature] Multinode docker container #2817

EgorovMike219 commented Jan 9, 2025

EgorovMike219 commented Jan 9, 2025

zhyncs commented Jan 10, 2025

[Feature] Multinode docker container #2817

[Feature] Multinode docker container #2817

Comments

EgorovMike219 commented Jan 9, 2025

Checklist

Motivation

Related resources

EgorovMike219 commented Jan 9, 2025

zhyncs commented Jan 10, 2025