Add containers/tgi/gpu/2.3.1/Dockerfile (#96)

* Add `containers/tgi/gpu/2.3.0/Dockerfile` starting image * Update `containers/tgi/gpu/2.3.0/Dockerfile` - Clones TGI from the latest release on `main` on top to use the cloned repository as reference for the files required during the build (before TGI 2.3.0 is released) - Removes the build args for `GIT_SHA` and `DOCKER_LABEL` as not required or used - Decreases the `MAX_JOBS` arg from 8 to 4 (can be tweaked depending onthe instance used to build the image) - Sets the `HUGGINGFACE_HUB_CACHE` environment variable to `/tmp` instead of the default `/data`, and sets the `PORT` to 8080 instead of the default 80, since that's the default open port in GCP instances - Removes the AWS SageMaker steps in favour of a custom final step installing `google-cloud-sdk` and running a custom `entrypoint.sh` that handles the download of the artifacts provided via `AIP_STORAGE_URI`(when provided in Vertex AI) * Add `containers/tgi/gpu/2.3.0/entrypoint.sh` * Downgrade to previous commit As previous commit has been tested more and is confirmed to support Gemma2 and DataGemma models * Rename `2.3.0` to `sha-2788d41` * Rename and udpate to TGI 2.3.0 * Update `Dockerfile` to be aligned wtih TGI 2.3.0 Main update w.r.t. the previous `Dockefile` within this repository is that now Python 3.11 is used instead of Python 3.10; and the SHA that this `Dockerfile` was using before the release was still using Python 3.10 * Update `README.md` references to TGI (2-2 -> 2-3) * Add missing `package` in `doc-pr-upload.yml` * Use URLs instead of rel paths in `available.mdx` * Update title to `Available DLCs on Google Cloud` * Update references to TGI 2.3.0 - Use `py311` tag instead of `py310` as now uses Python 3.11 - Use `cu124` tag instead of `cu121` as now uses CUDA 12.4 * Delete 2.3.0 container in favour of 2.3.1 (WIP) * Add `containers/tgi/gpu/2.3.1/Dockerfile` * Add `containers/tgi/gpu/2.3.1/entrypoint.sh` * Update `containers/tgi/gpu/2.3.1/Dockerfile` * Update references to latest TGI container - Updated `containers/tgi/README.md` - Updated `docs/source/containers/available.mdx` - Updated `README.md`
huggingface · Oct 7, 2024 · 9ae5e7d · 9ae5e7d
1 parent 7c0716e
commit 9ae5e7d
Show file tree

Hide file tree

Showing 6 changed files with 342 additions and 10 deletions.
diff --git a/.github/workflows/doc-pr-upload.yml b/.github/workflows/doc-pr-upload.yml
@@ -10,6 +10,7 @@ jobs:
   build:
     uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
     with:
+      package: Google-Cloud-Containers
       package_name: google-cloud
     secrets:
       hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}

diff --git a/README.md b/README.md
@@ -26,7 +26,7 @@ The [Google-Cloud-Containers](https://github.com/huggingface/Google-Cloud-Contai
 
 | Container URI                                                                                                                     | Path                                                                                                                                               | Framework | Type      | Accelerator |
 | --------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | --------- | --------- | ----------- |
-| us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310           | [text-generation-inference-gpu.2.2.0](./containers/tgi/gpu/2.2.0/Dockerfile)                                                                       | TGI       | Inference | GPU         |
+| us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311           | [text-generation-inference-gpu.2.3.1](./containers/tgi/gpu/2.3.1/Dockerfile)                                                                       | TGI       | Inference | GPU         |
 | us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cu122.1-4.ubuntu2204                 | [text-embeddings-inference-gpu.1.4.0](./containers/tei/gpu/1.4.0/Dockerfile)                                                                       | TEI       | Inference | GPU         |
 | us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-embeddings-inference-cpu.1-4                              | [text-embeddings-inference-cpu.1.4.0](./containers/tei/cpu/1.4.0/Dockerfile)                                                                       | TEI       | Inference | CPU         |
 | us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-cu121.2-3.transformers.4-42.ubuntu2204.py310  | [huggingface-pytorch-training-gpu.2.3.0.transformers.4.42.3.py310](./containers/pytorch/training/gpu/2.3.0/transformers/4.42.3/py310/Dockerfile)   | PyTorch   | Training  | GPU         |

diff --git a/containers/tgi/README.md b/containers/tgi/README.md
@@ -57,7 +57,7 @@ docker run --gpus all -ti --shm-size 1g -p 8080:8080 \
     -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
     -e MAX_INPUT_LENGTH=4000 \
     -e MAX_TOTAL_TOKENS=4096 \
-    us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310
+    us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311
 ```
 
 ### Test
@@ -111,5 +111,5 @@ curl 0.0.0.0:8080/generate \
 In order to build TGI Docker container, you will need an instance with at least 4 NVIDIA GPUs available with at least 24 GiB of VRAM each, since TGI needs to build and compile the kernels required for the optimized inference. Also note that the build process may take ~30 minutes to complete, depending on the instance's specifications.
 
 ```bash
-docker build -t us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310 -f containers/tgi/gpu/2.2.0/Dockerfile .
+docker build -t us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311 -f containers/tgi/gpu/2.3.1/Dockerfile .
 ```
diff --git a/containers/tgi/gpu/2.3.1/Dockerfile b/containers/tgi/gpu/2.3.1/Dockerfile
@@ -0,0 +1,301 @@
+# Fetch and extract the TGI sources
+FROM alpine AS tgi
+RUN mkdir -p /tgi
+ADD https://github.com/huggingface/text-generation-inference/archive/refs/tags/v2.3.1.tar.gz /tgi/sources.tar.gz
+RUN tar -C /tgi -xf /tgi/sources.tar.gz --strip-components=1
+
+# Rust builder
+FROM lukemathwalker/cargo-chef:latest-rust-1.80 AS chef
+WORKDIR /usr/src
+
+ARG CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse
+
+FROM chef AS planner
+COPY --from=tgi /tgi/Cargo.lock Cargo.lock
+COPY --from=tgi /tgi/Cargo.toml Cargo.toml
+COPY --from=tgi /tgi/rust-toolchain.toml rust-toolchain.toml
+COPY --from=tgi /tgi/proto proto
+COPY --from=tgi /tgi/benchmark benchmark
+COPY --from=tgi /tgi/router router
+COPY --from=tgi /tgi/backends backends
+COPY --from=tgi /tgi/launcher launcher
+
+RUN cargo chef prepare --recipe-path recipe.json
+
+FROM chef AS builder
+
+RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
+    python3.11-dev
+RUN PROTOC_ZIP=protoc-21.12-linux-x86_64.zip && \
+    curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP && \
+    unzip -o $PROTOC_ZIP -d /usr/local bin/protoc && \
+    unzip -o $PROTOC_ZIP -d /usr/local 'include/*' && \
+    rm -f $PROTOC_ZIP
+
+COPY --from=planner /usr/src/recipe.json recipe.json
+RUN cargo chef cook --profile release-opt --recipe-path recipe.json
+
+COPY --from=tgi /tgi/Cargo.toml Cargo.toml
+COPY --from=tgi /tgi/rust-toolchain.toml rust-toolchain.toml
+COPY --from=tgi /tgi/proto proto
+COPY --from=tgi /tgi/benchmark benchmark
+COPY --from=tgi /tgi/router router
+COPY --from=tgi /tgi/backends backends
+COPY --from=tgi /tgi/launcher launcher
+RUN cargo build --profile release-opt --features google
+
+# Python builder
+# Adapted from: https://github.com/pytorch/pytorch/blob/master/Dockerfile
+FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS pytorch-install
+
+# NOTE: When updating PyTorch version, beware to remove `pip install nvidia-nccl-cu12==2.22.3` below in the Dockerfile. Context: https://github.com/huggingface/text-generation-inference/pull/2099
+ARG PYTORCH_VERSION=2.4.0
+
+ARG PYTHON_VERSION=3.11
+# Keep in sync with `server/pyproject.toml
+ARG CUDA_VERSION=12.4
+ARG MAMBA_VERSION=24.3.0-0
+ARG CUDA_CHANNEL=nvidia
+ARG INSTALL_CHANNEL=pytorch
+# Automatically set by buildx
+ARG TARGETPLATFORM
+
+ENV PATH /opt/conda/bin:$PATH
+
+RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
+    build-essential \
+    ca-certificates \
+    ccache \
+    curl \
+    git && \
+    rm -rf /var/lib/apt/lists/*
+
+# Install conda
+# translating Docker's TARGETPLATFORM into mamba arches
+RUN case ${TARGETPLATFORM} in \
+    "linux/arm64")  MAMBA_ARCH=aarch64  ;; \
+    *)              MAMBA_ARCH=x86_64   ;; \
+    esac && \
+    curl -fsSL -v -o ~/mambaforge.sh -O  "https://github.com/conda-forge/miniforge/releases/download/${MAMBA_VERSION}/Mambaforge-${MAMBA_VERSION}-Linux-${MAMBA_ARCH}.sh"
+RUN chmod +x ~/mambaforge.sh && \
+    bash ~/mambaforge.sh -b -p /opt/conda && \
+    rm ~/mambaforge.sh
+
+# Install pytorch
+# On arm64 we exit with an error code
+RUN case ${TARGETPLATFORM} in \
+    "linux/arm64")  exit 1 ;; \
+    *)              /opt/conda/bin/conda update -y conda &&  \
+    /opt/conda/bin/conda install -c "${INSTALL_CHANNEL}" -c "${CUDA_CHANNEL}" -y "python=${PYTHON_VERSION}" "pytorch=$PYTORCH_VERSION" "pytorch-cuda=$(echo $CUDA_VERSION | cut -d'.' -f 1-2)"  ;; \
+    esac && \
+    /opt/conda/bin/conda clean -ya
+
+# CUDA kernels builder image
+FROM pytorch-install AS kernel-builder
+
+ARG MAX_JOBS=4
+ENV TORCH_CUDA_ARCH_LIST="8.0;8.6;9.0+PTX"
+
+RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
+    ninja-build cmake \
+    && rm -rf /var/lib/apt/lists/*
+
+# Build Flash Attention CUDA kernels
+FROM kernel-builder AS flash-att-builder
+
+WORKDIR /usr/src
+
+COPY --from=tgi /tgi/server/Makefile-flash-att Makefile
+
+# Build specific version of flash attention
+RUN make build-flash-attention
+
+# Build Flash Attention v2 CUDA kernels
+FROM kernel-builder AS flash-att-v2-builder
+
+WORKDIR /usr/src
+
+COPY --from=tgi /tgi/server/Makefile-flash-att-v2 Makefile
+
+# Build specific version of flash attention v2
+RUN make build-flash-attention-v2-cuda
+
+# Build Transformers exllama kernels
+FROM kernel-builder AS exllama-kernels-builder
+WORKDIR /usr/src
+COPY --from=tgi /tgi/server/exllama_kernels/ .
+
+RUN python setup.py build
+
+# Build Transformers exllama kernels
+FROM kernel-builder AS exllamav2-kernels-builder
+WORKDIR /usr/src
+COPY --from=tgi /tgi/server/Makefile-exllamav2/ Makefile
+
+# Build specific version of transformers
+RUN make build-exllamav2
+
+# Build Transformers awq kernels
+FROM kernel-builder AS awq-kernels-builder
+WORKDIR /usr/src
+COPY --from=tgi /tgi/server/Makefile-awq Makefile
+# Build specific version of transformers
+RUN make build-awq
+
+# Build eetq kernels
+FROM kernel-builder AS eetq-kernels-builder
+WORKDIR /usr/src
+COPY --from=tgi /tgi/server/Makefile-eetq Makefile
+# Build specific version of transformers
+RUN make build-eetq
+
+# Build Lorax Punica kernels
+FROM kernel-builder AS lorax-punica-builder
+WORKDIR /usr/src
+COPY --from=tgi /tgi/server/Makefile-lorax-punica Makefile
+# Build specific version of transformers
+RUN TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" make build-lorax-punica
+
+# Build Transformers CUDA kernels
+FROM kernel-builder AS custom-kernels-builder
+WORKDIR /usr/src
+COPY --from=tgi /tgi/server/custom_kernels/ .
+# Build specific version of transformers
+RUN python setup.py build
+
+# Build FBGEMM CUDA kernels
+FROM kernel-builder AS fbgemm-builder
+
+WORKDIR /usr/src
+
+COPY --from=tgi /tgi/server/Makefile-fbgemm Makefile
+
+RUN make build-fbgemm
+
+# Build vllm CUDA kernels
+FROM kernel-builder AS vllm-builder
+
+WORKDIR /usr/src
+
+ENV TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 8.9 9.0+PTX"
+
+COPY --from=tgi /tgi/server/Makefile-vllm Makefile
+
+# Build specific version of vllm
+RUN make build-vllm-cuda
+
+# Build mamba kernels
+FROM kernel-builder AS mamba-builder
+WORKDIR /usr/src
+COPY --from=tgi /tgi/server/Makefile-selective-scan Makefile
+RUN make build-all
+
+# Build flashinfer
+FROM kernel-builder AS flashinfer-builder
+WORKDIR /usr/src
+COPY --from=tgi /tgi/server/Makefile-flashinfer Makefile
+RUN make install-flashinfer
+
+# Text Generation Inference base image
+FROM nvidia/cuda:12.1.0-base-ubuntu22.04 AS base
+
+# Conda env
+ENV PATH=/opt/conda/bin:$PATH \
+    CONDA_PREFIX=/opt/conda
+
+# Text Generation Inference base env
+ENV HF_HOME=/tmp \
+    HF_HUB_ENABLE_HF_TRANSFER=1 \
+    PORT=8080
+
+WORKDIR /usr/src
+
+RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
+    libssl-dev \
+    ca-certificates \
+    make \
+    curl \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+
+# Copy conda with PyTorch installed
+COPY --from=pytorch-install /opt/conda /opt/conda
+
+# Copy build artifacts from flash attention builder
+COPY --from=flash-att-builder /usr/src/flash-attention/build/lib.linux-x86_64-cpython-311 /opt/conda/lib/python3.11/site-packages
+COPY --from=flash-att-builder /usr/src/flash-attention/csrc/layer_norm/build/lib.linux-x86_64-cpython-311 /opt/conda/lib/python3.11/site-packages
+COPY --from=flash-att-builder /usr/src/flash-attention/csrc/rotary/build/lib.linux-x86_64-cpython-311 /opt/conda/lib/python3.11/site-packages
+
+# Copy build artifacts from flash attention v2 builder
+COPY --from=flash-att-v2-builder /opt/conda/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so /opt/conda/lib/python3.11/site-packages
+
+# Copy build artifacts from custom kernels builder
+COPY --from=custom-kernels-builder /usr/src/build/lib.linux-x86_64-cpython-311 /opt/conda/lib/python3.11/site-packages
+# Copy build artifacts from exllama kernels builder
+COPY --from=exllama-kernels-builder /usr/src/build/lib.linux-x86_64-cpython-311 /opt/conda/lib/python3.11/site-packages
+# Copy build artifacts from exllamav2 kernels builder
+COPY --from=exllamav2-kernels-builder /usr/src/exllamav2/build/lib.linux-x86_64-cpython-311 /opt/conda/lib/python3.11/site-packages
+# Copy build artifacts from awq kernels builder
+COPY --from=awq-kernels-builder /usr/src/llm-awq/awq/kernels/build/lib.linux-x86_64-cpython-311 /opt/conda/lib/python3.11/site-packages
+# Copy build artifacts from eetq kernels builder
+COPY --from=eetq-kernels-builder /usr/src/eetq/build/lib.linux-x86_64-cpython-311 /opt/conda/lib/python3.11/site-packages
+# Copy build artifacts from lorax punica kernels builder
+COPY --from=lorax-punica-builder /usr/src/lorax-punica/server/punica_kernels/build/lib.linux-x86_64-cpython-311 /opt/conda/lib/python3.11/site-packages
+# Copy build artifacts from fbgemm builder
+COPY --from=fbgemm-builder /usr/src/fbgemm/fbgemm_gpu/_skbuild/linux-x86_64-3.11/cmake-install /opt/conda/lib/python3.11/site-packages
+# Copy build artifacts from vllm builder
+COPY --from=vllm-builder /usr/src/vllm/build/lib.linux-x86_64-cpython-311 /opt/conda/lib/python3.11/site-packages
+# Copy build artifacts from mamba builder
+COPY --from=mamba-builder /usr/src/mamba/build/lib.linux-x86_64-cpython-311/ /opt/conda/lib/python3.11/site-packages
+COPY --from=mamba-builder /usr/src/causal-conv1d/build/lib.linux-x86_64-cpython-311/ /opt/conda/lib/python3.11/site-packages
+COPY --from=flashinfer-builder /opt/conda/lib/python3.11/site-packages/flashinfer/ /opt/conda/lib/python3.11/site-packages/flashinfer/
+
+# Install flash-attention dependencies
+RUN pip install einops --no-cache-dir
+
+# Install server
+COPY --from=tgi /tgi/proto proto
+COPY --from=tgi /tgi/server server
+COPY --from=tgi /tgi/server/Makefile server/Makefile
+RUN cd server && \
+    make gen-server && \
+    pip install -r requirements_cuda.txt && \
+    pip install ".[bnb, accelerate, marlin, moe, quantize, peft, outlines]" --no-cache-dir && \
+    pip install nvidia-nccl-cu12==2.22.3
+
+ENV LD_PRELOAD=/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/libnccl.so.2
+# Required to find libpython within the rust binaries
+ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/opt/conda/lib/"
+# This is needed because exl2 tries to load flash-attn
+# And fails with our builds.
+ENV EXLLAMA_NO_FLASH_ATTN=1
+
+# Deps before the binaries
+# The binaries change on every build given we burn the SHA into them
+# The deps change less often.
+RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
+    build-essential \
+    g++ \
+    && rm -rf /var/lib/apt/lists/*
+
+# Install benchmarker
+COPY --from=builder /usr/src/target/release-opt/text-generation-benchmark /usr/local/bin/text-generation-benchmark
+# Install router
+COPY --from=builder /usr/src/target/release-opt/text-generation-router /usr/local/bin/text-generation-router
+# Install launcher
+COPY --from=builder /usr/src/target/release-opt/text-generation-launcher /usr/local/bin/text-generation-launcher
+
+# Final image
+FROM base
+
+# Install Google CLI single command
+RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" \
+    | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \
+    curl https://packages.cloud.google.com/apt/doc/apt-key.gpg \
+    | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && \
+    apt-get update -y && \
+    apt-get install google-cloud-sdk -y
+
+# COPY custom entrypoint for Google
+COPY --chmod=775 containers/tgi/gpu/2.3.1/entrypoint.sh entrypoint.sh
+ENTRYPOINT ["./entrypoint.sh"]
diff --git a/containers/tgi/gpu/2.3.1/entrypoint.sh b/containers/tgi/gpu/2.3.1/entrypoint.sh
@@ -0,0 +1,30 @@
+#!/bin/bash
+
+# Check if MODEL_ID starts with "gcs://"
+if [[ $AIP_STORAGE_URI == gs://* ]]; then
+    echo "AIP_STORAGE_URI set and starts with 'gs://', proceeding to download from GCS."
+    echo "AIP_STORAGE_URI: $AIP_STORAGE_URI"
+
+    # Define the target directory
+    TARGET_DIR="/tmp/model"
+    mkdir -p "$TARGET_DIR"
+
+    # Use gsutil to copy the content from GCS to the target directory
+    echo "Running: gcloud storage storage cp $AIP_STORAGE_URI/* $TARGET_DIR --recursive"
+    gcloud storage cp "$AIP_STORAGE_URI/*" "$TARGET_DIR" --recursive
+
+    # Check if gsutil command was successful
+    if [ $? -eq 0 ]; then
+        echo "Model downloaded successfully to ${TARGET_DIR}."
+        # Update MODEL_ID to point to the local directory
+        echo "Updating MODEL_ID to point to the local directory."
+        export MODEL_ID="$TARGET_DIR"
+    else
+        echo "Failed to download model from GCS."
+        exit 1
+    fi
+fi
+
+ldconfig 2>/dev/null || echo 'unable to refresh ld cache, not a big deal in most cases'
+
+text-generation-launcher $@