Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pytorch/training/gpu/2.3.1/transformers/4.48.0/py311/Dockerfile #134

Merged
merged 15 commits into from
Jan 14, 2025
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions containers/pytorch/training/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ The PyTorch Training containers will start a training job that will start on `do
docker run --gpus all -ti \
-v $(pwd)/artifact:/artifact \
-e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-cu121.2-3.transformers.4-42.ubuntu2204.py310 \
us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-cu121.2-3.transformers.4-47.ubuntu2204.py311 \
trl sft \
--model_name_or_path google/gemma-2b \
--attn_implementation "flash_attention_2" \
Expand Down Expand Up @@ -76,7 +76,7 @@ The PyTorch Training containers come with two different containers depending on
- **GPU**: To build the PyTorch Training container for GPU, an instance with at least one NVIDIA GPU available is required to install `flash-attn` (used to speed up the attention layers during training and inference).

```bash
docker build -t us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-cu121.2-3.transformers.4-42.ubuntu2204.py310 -f containers/pytorch/training/gpu/2.3.0/transformers/4.42.3/py310/Dockerfile .
docker build -t us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-cu121.2-3.transformers.4-47.ubuntu2204.py311 -f containers/pytorch/training/gpu/2.3.0/transformers/4.47.1/py311/Dockerfile .
```

- **TPU**: To build the PyTorch Training container for Google Cloud TPUs, an instance with at least one TPU available is required to install `optimum-tpu` which is a Python library with Google TPU optimizations for `transformers` models, making its integration seamless.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
FROM nvidia/cuda:12.1.1-devel-ubuntu22.04
SHELL ["/bin/bash", "-c"]

LABEL maintainer="Hugging Face"
ARG DEBIAN_FRONTEND=noninteractive

# Versions
ARG CUDA="cu121"
ARG PYTORCH="2.3.1"
ARG FLASH_ATTN="2.6.3"
ARG TRANSFORMERS="4.47.1"
ARG HUGGINGFACE_HUB="0.27.0"
ARG DIFFUSERS="0.32.1"
ARG PEFT="0.14.0"
ARG TRL="0.13.0"
ARG BITSANDBYTES="0.45.0"
ARG DATASETS="3.2.0"
ARG ACCELERATE="1.2.1"
ARG EVALUATE="0.4.3"
ARG SENTENCE_TRANSFORMERS="3.3.1"
ARG DEEPSPEED="0.16.1"
ARG MAX_JOBS=4

RUN apt-get update -y && \
apt-get install software-properties-common -y && \
add-apt-repository ppa:deadsnakes/ppa && \
apt-get -y upgrade --only-upgrade systemd openssl cryptsetup && \
apt-get install -y \
build-essential \
bzip2 \
curl \
git \
git-lfs \
tar \
gcc \
g++ \
cmake \
gnupg \
libprotobuf-dev \
libaio-dev \
protobuf-compiler \
python3.11 \
python3.11-dev \
libsndfile1-dev \
ffmpeg && \
apt-get clean autoremove --yes && \
rm -rf /var/lib/apt/lists/*

# Set Python 3.11 as the default python version
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 && \
ln -sf /usr/bin/python3.11 /usr/bin/python

# Install pip from source and upgrade it
RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py && \
pip install --upgrade pip

# Download the latest installer
ADD https://astral.sh/uv/install.sh /uv-installer.sh

# Run the installer then remove it
RUN sh /uv-installer.sh && rm /uv-installer.sh

# Ensure the installed binary is on the `PATH`, and use system's Python as default
ENV PATH="/root/.local/bin/:$PATH" \
UV_SYSTEM_PYTHON=1

# Set alias
RUN printf '#!/bin/bash\nuv pip "$@"' > /usr/local/bin/pip && chmod +x /usr/local/bin/pip

# Install latest release PyTorch (PyTorch must be installed before any DeepSpeed C++/CUDA ops.)
RUN pip install --no-cache-dir --index-url https://download.pytorch.org/whl/${CUDA} "torch==${PYTORCH}" torchvision torchaudio

# Install and upgrade Flash Attention 2
RUN pip install --no-cache-dir packaging ninja
RUN MAX_JOBS=${MAX_JOBS} pip install --no-build-isolation flash-attn==${FLASH_ATTN}

# Install Hugging Face Libraries
RUN pip install --no-cache-dir \
"transformers[sklearn,sentencepiece,vision]==${TRANSFORMERS}" \
"huggingface_hub[hf_transfer]==${HUGGINGFACE_HUB}" \
"diffusers==${DIFFUSERS}" \
"datasets==${DATASETS}" \
"accelerate==${ACCELERATE}" \
"evaluate==${EVALUATE}" \
"peft==${PEFT}" \
"trl==${TRL}" \
"sentence-transformers==${SENTENCE_TRANSFORMERS}" \
"deepspeed==${DEEPSPEED}" \
"bitsandbytes==${BITSANDBYTES}" \
tensorboard \
jupyter notebook

ENV HF_HUB_ENABLE_HF_TRANSFER="1"

# Install Google Cloud Dependencies
RUN pip install --upgrade --no-cache-dir \
google-cloud-storage \
google-cloud-bigquery \
google-cloud-aiplatform \
google-cloud-pubsub \
google-cloud-logging \
"protobuf<4.0.0"

# Install Google CLI single command
RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" \
| tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg \
| apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && \
touch /var/lib/dpkg/status && \
apt-get update -y && \
apt-get install google-cloud-sdk -y && \
apt-get clean autoremove --yes && \
rm -rf /var/lib/{apt,dpkg,cache,log}
Loading