Skip to content

Commit

Permalink
Add a dockerfile to text-generation example and update README for 1.19
Browse files Browse the repository at this point in the history
  • Loading branch information
louie-tsai committed Jan 25, 2025
1 parent 66bc191 commit 7c78502
Show file tree
Hide file tree
Showing 2 changed files with 85 additions and 17 deletions.
21 changes: 21 additions & 0 deletions examples/text-generation/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
FROM vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest

# Need node to build doc HTML. Taken from https://stackoverflow.com/a/67491580
RUN apt-get update && apt-get install -y \
software-properties-common \
npm
RUN npm install n -g && \
n latest

RUN python3 -m pip install --no-cache-dir --upgrade pip
RUN python3 -m pip install --upgrade-strategy eager optimum[habana]
RUN python3 -m pip install git+https://github.com/HabanaAI/[email protected]

RUN mkdir -p /workspace
WORKDIR /workspace
RUN git clone https://github.com/huggingface/optimum-habana && cd optimum-habana && git checkout v1.15.0

WORKDIR /workspace/optimum-habana/examples/text-generation
RUN python3 -m pip install -r requirements.txt
RUN python3 -m pip install -r requirements_lm_eval.txt

81 changes: 64 additions & 17 deletions examples/text-generation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ Conditional text generation on Intel® Gaudi® AI Accelerators. You can find mor

## Requirements

Please make sure to follow [Driver Installation](https://docs.habana.ai/en/latest/Installation_Guide/Driver_Installation.html) to install Gaudi driver on the system.
### Bare metal
First, you should install the requirements:
```bash
pip install -r requirements.txt
Expand All @@ -35,7 +37,23 @@ Then, if you plan to use [DeepSpeed-inference](https://docs.habana.ai/en/latest/
```bash
pip install git+https://github.com/HabanaAI/[email protected]
```
### Docker Image
To use dockerfile provided for the sample, please follow [Docker Installation](https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/Docker_Installation.html) to setup habana runtime for Docker images.

#### Docker Build
To build the image from the Dockerfile, please follow below command to build the optimum-habana-text-gen image.
```bash
docker build --no-cache -t optimum-habana-text-gen:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f Dockerfile .
```
#### Docker Run
After docker build, users could follow below command to run and docker instance and users will be in the docker instance under text-generation folder.
```bash
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=ALL --privileged=true --net=host --ipc=host optimum-habana-text-gen:latest
```
> [!NOTE]
> The Huggingface model file size might be large, so we recommend to use an external disk as Huggingface hub folder. \
> Please export HF_HOME environment variable to your external disk and then export the mount point into docker instance. \
> ex: "-e HF_HOME=/mnt/huggingface -v /mnt:/mnt"
## Usage

Expand Down Expand Up @@ -132,7 +150,18 @@ Here are a few settings you may be interested in:
- `--prompt` to benchmark the model on one or several prompts of your choice
- `--attn_softmax_bf16` to run attention softmax layer in bfloat16 precision provided that the model (such as Llama) supports it
- `--trim_logits` to calculate logits only for the last token in the first time step provided that the model (such as Llama) supports it

- `--bucket_size` to grow the cache/input in multiples of `bucket_size` instead of padding up the kv-cache up to full size before starting
- `--bucket_internal` more optimized version of bucketing for certain models like Llama
- `--flash_attention_causal_mask` to further improve performance by taking advantage of specific lower-diagonal shape of inputs to softmax operation
- `--use_flash_attention` to enable Habana Flash Attention
- `--flash_attention_recompute` to reduce memory consumption on prompt stage
- `--book_source` to use project Guttenberg books data as input. Usefull for testing large sequence lenghts.

There are also some environment variables useful for benchmarking:
- `export HF_DATASETS_TRUST_REMOTE_CODE=true` : Most datasets in lm-evaluation-harness are defined on HF using dataset scripts, and may require passing HF_DATASETS_TRUST_REMOTE_CODE=true
- `export TQDM_DISABLE=1`: to disable all tqdm progress bars

#### bigscience/bloom
For example, you can reproduce the results presented in [this blog post](https://huggingface.co/blog/habana-gaudi-2-bloom) with the following command:
```bash
python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
Expand All @@ -143,7 +172,7 @@ python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
--max_new_tokens 100 \
--sdp_on_bf16
```

#### Llama2-70B
You can also run Llama2-70B on Gaudi2 with all optimizations enabled using the following command:
```bash
python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
Expand All @@ -159,7 +188,7 @@ python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
--trim_logits \
--sdp_on_bf16
```

#### Falcon-7B
To run Falcon-7B inference, use the following command:
```bash
python run_generation.py \
Expand All @@ -172,7 +201,7 @@ python run_generation.py \
--do_sample \
--sdp_on_bf16
```

#### Falcon-40B
To run Falcon-40B inference on 8 Gaudi2 cards, use the following command:
```bash
python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
Expand All @@ -186,7 +215,7 @@ python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
--use_flash_attention \
--flash_attention_causal_mask
```

#### Llama3-405B
To run Llama3-405B inference on 8 Gaudi3 cards use the following command:
```bash
python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
Expand Down Expand Up @@ -357,6 +386,8 @@ Llama2-70b, Llama2-7b, Llama3-70b, Llama3-8b, Mixtral-8x7B, Falcon-7B, Falcon-40
More information on enabling fp8 in SynapseAI is available here:
https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html

#### Llama2-70b
##### 1. tensor quantization statistics
Here is an example to measure the tensor quantization statistics on LLama2-70b:
```bash
QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py \
Expand All @@ -374,7 +405,7 @@ QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py
--bf16 \
--batch_size 1
```

##### 2. quantize and run the model
Here is an example to quantize the model based on previous measurements for LLama2-70b:
```bash
QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py \
Expand Down Expand Up @@ -405,13 +436,19 @@ QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py \
--reuse_cache \
--use_flash_attention \
--flash_attention_recompute \
--flash_attention_causal_mask \
--use_flash_attention \
--bf16 \
--batch_size 350 \
--max_new_tokens 2048 \
--max_input_tokens 2048 \
--warmup 2 \
--bucket_size=128 \
--bucket_internal \
--limit_hpu_graphs
```

#### Mixtral-8x7B
##### 1. tensor quantization statistics
Here is an example to measure the tensor quantization statistics on Mixtral-8x7B with 1 card:
```bash
QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_generation.py \
Expand All @@ -424,7 +461,7 @@ QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_generation.py
--batch_size 1 \
--bf16
```

##### 2. quantize and run the model
Here is an example to quantize the model based on previous measurements for Mixtral-8x7B with 1 card:
```bash
QUANT_CONFIG=./quantization_config/maxabs_quant_mixtral.json python run_generation.py \
Expand All @@ -437,7 +474,8 @@ QUANT_CONFIG=./quantization_config/maxabs_quant_mixtral.json python run_generati
--batch_size 16 \
--bf16
```

#### Falcon-180B
##### 1. tensor quantization statistics
Here is an example to measure the tensor quantization statistics on Falcon-180B with 8 cards:
> Please note that Falcon-180B is a gated model, and users are required to request access to it. Please refer to the instructions provided in the StarCoder example above.
```bash
Expand All @@ -455,7 +493,7 @@ QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python ..
--flash_attention_recompute \
--flash_attention_causal_mask
```

##### 2. quantize and run the model
Here is an example to quantize the model based on previous measurements for Falcon-180B with 8 cards:
```bash
QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py \
Expand All @@ -474,7 +512,8 @@ QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py \
--flash_attention_recompute \
--flash_attention_causal_mask
```

#### Llama3-405B
##### 1. tensor quantization statistics
Here is an example to measure the tensor quantization statistics on Llama3-405B with 8 cards:
> Please note that Llama3-405B requires minimum 16 cards Gaudi2 and 8 cards Gaudi3.
```bash
Expand All @@ -492,27 +531,33 @@ QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python ..
--flash_attention_recompute \
--flash_attention_causal_mask
```

##### 2. quantize and run the model
Here is an example to quantize the model based on previous measurements for Llama3-405B with 8 cards:
> Please note that Llama3-405B requires minimum 16 cards Gaudi2 and 8 cards Gaudi3.
```bash
QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py \
--use_deepspeed --world_size 8 run_generation.py \
--model_name_or_path meta-llama/Llama-3.1-405B-Instruct \
--attn_softmax_bf16 \
--warmup 2 \
--use_hpu_graphs \
--use_kv_cache \
--limit_hpu_graphs \
--bucket_size=128 \
--bucket_internal \
--max_input_tokens 2048 \
--max_new_tokens 2048 \
--batch_size 2 \
--batch_size 180 \
--bf16 \
--reuse_cache \
--trim_logits \
--use_flash_attention \
--flash_attention_recompute \
--flash_attention_causal_mask
--flash_attention_causal_mask \
--book_source
```

#### phi-2
##### 1. tensor quantization statistics
Here is an example to measure the tensor quantization statistics on phi-2 with 1 card:

```bash
Expand All @@ -527,7 +572,7 @@ QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_lm_eval.py \
--reuse_cache \
--bf16
```

##### 2. quantize and run the model
Here is an example to quantize the model based on previous measurements for phi-2 with 1 card:
```bash
QUANT_CONFIG=./quantization_config/maxabs_quant_phi.json python run_generation.py \
Expand All @@ -541,6 +586,8 @@ QUANT_CONFIG=./quantization_config/maxabs_quant_phi.json python run_generation.p
--reuse_cache
```

#### gemma
##### 1. tensor quantization statistics
Here is an example to measure the tensor quantization statistics on gemma with 1 card:

```bash
Expand All @@ -554,7 +601,7 @@ QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_generation.py
--bf16 \
--sdp_on_bf16
```

##### 2. quantize and run the model
Here is an example to quantize the model based on previous measurements for gemma with 1 card:
```bash
QUANT_CONFIG=./quantization_config/maxabs_quant_gemma.json python run_generation.py \
Expand Down

0 comments on commit 7c78502

Please sign in to comment.