From 7bb340defa0b8ca1542c6b89238c2be06a08e7cb Mon Sep 17 00:00:00 2001 From: Libin Tang Date: Fri, 24 Jan 2025 14:14:15 -0800 Subject: [PATCH] Readme modification (#1700) Co-authored-by: Vidya Galli Co-authored-by: regisss <15324346+regisss@users.noreply.github.com> --- examples/image-to-text/README.md | 322 ++------------------- examples/text-feature-extraction/README.md | 7 - 2 files changed, 26 insertions(+), 303 deletions(-) diff --git a/examples/image-to-text/README.md b/examples/image-to-text/README.md index e4dbb05472..7a8ad04664 100644 --- a/examples/image-to-text/README.md +++ b/examples/image-to-text/README.md @@ -17,111 +17,12 @@ limitations under the License. # Image to Text Examples This directory contains a script that showcases how to perform image to text generation on IntelĀ® GaudiĀ® AI Accelerators. -## Single-HPU inference +Habana FusedSDPA is a fused and optimized implementation of torch.nn.functional.scaled_dot_product_attention() for Gaudi. For more details, refer to [Gaudi online documentation](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_PyTorch_Models.html?highlight=fusedsdpa#using-fused-scaled-dot-product-attention-fusedsdpa). We optimized many models with FusedSDPA implementation as in [optimum/habana/transformers/models](https://github.com/huggingface/optimum-habana/tree/main/optimum/habana/transformers/models). If a model is not optimized with FusedSDPA, it uses [SDPA implementation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html). -Models that have been validated: - - [nlpconnect/vit-gpt2-image-captioning](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning) - - [Salesforce/blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large) - - [Salesforce/blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base) - - [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) - - [llava-hf/llava-1.5-13b-hf](https://huggingface.co/llava-hf/llava-1.5-13b-hf) - - [llava-hf/llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) - - [llava-hf/llava-v1.6-vicuna-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-7b-hf) - - [llava-hf/llava-v1.6-vicuna-13b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf) - - [llava-hf/llava-v1.6-34b-hf](https://huggingface.co/llava-hf/llava-v1.6-34b-hf) - - [llava-hf/llama3-llava-next-8b-hf](https://huggingface.co/llava-hf/llama3-llava-next-8b-hf) - - [HuggingFaceM4/idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b) - - [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) - - [meta-llama/Llama-3.2-90B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct) - - [tiiuae/falcon-11B-vlm](https://huggingface.co/tiiuae/falcon-11B-vlm) - - [google/paligemma-3b-mix-224](https://huggingface.co/google/paligemma-3b-mix-224) +## Inference with mixed-precision (BF16) -### Inference with BF16 - -To run Salesforce/blip-image-captioning-large inference, use the following command: -```bash -python3 run_pipeline.py \ - --model_name_or_path Salesforce/blip-image-captioning-large \ - --image_path "https://ankur3107.github.io/assets/images/image-captioning-example.png" \ - --use_hpu_graphs \ - --bf16 \ - --sdp_on_bf16 -``` - -To run Llava-1.5-7b inference, use the following command: -```bash -python3 run_pipeline.py \ - --model_name_or_path llava-hf/llava-1.5-7b-hf \ - --use_hpu_graphs \ - --bf16 \ - --sdp_on_bf16 -``` - -To run Llava-1.5-13b inference, use the following command: -```bash -python3 run_pipeline.py \ - --model_name_or_path llava-hf/llava-1.5-13b-hf \ - --use_hpu_graphs \ - --bf16 \ - --sdp_on_bf16 -``` - -To run Llava-v1.6-mistral-7b inference, use the following command: -```bash -python3 run_pipeline.py \ - --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \ - --use_hpu_graphs \ - --bf16 \ - --sdp_on_bf16 -``` - -To run Llava-v1.6-vicuna-13b inference, use the following command: -```bash -python3 run_pipeline.py \ - --model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \ - --use_hpu_graphs \ - --bf16 \ - --sdp_on_bf16 -``` - -To run Llava-hf/llava-v1.6-34b-hf inference, use the following command: -```bash -python3 run_pipeline.py \ - --model_name_or_path llava-hf/llava-v1.6-34b-hf \ - --use_hpu_graphs \ - --bf16 \ - --sdp_on_bf16 -``` - -To run google/paligemma-3b-mix-224 inference, use the following command: -```bash -python3 run_pipeline.py \ - --model_name_or_path google/paligemma-3b-mix-224 \ - --use_hpu_graphs \ - --bf16 \ - --sdp_on_bf16 -``` - -To run Llava-hf/llama3-llava-next-8b-hf inference, use the following command: -```bash -python3 run_pipeline.py \ - --model_name_or_path llava-hf/llama3-llava-next-8b-hf \ - --use_hpu_graphs \ - --bf16 \ - --sdp_on_bf16 -``` - -To run idefics2 inference, use the following command: - -```bash -python3 run_pipeline.py \ - --model_name_or_path HuggingFaceM4/idefics2-8b \ - --use_hpu_graphs \ - --bf16 \ - --sdp_on_bf16 -``` - -To run mllama inference using reduced precision in the SDPA, use the following command: +### Single card inference with BF16 +To run Llama inference with SDPA, use the following command: ```bash python3 run_pipeline.py \ @@ -130,55 +31,30 @@ python3 run_pipeline.py \ --bf16 \ --sdp_on_bf16 ``` +> SDPA may introduce [reduced precison](https://pytorch.org/docs/stable/notes/numerical_accuracy.html#reduced-precision-reduction-for-fp16-and-bf16-in-scaled-dot-product-attention-sdpa) -### Inference with FP8 -Inference for Llava-1.5-7b, Llava-1.5-13b, Llava-v1.6-mistral-7b and Llava-v1.6-vicuna-13b in FP8 precision are enabled using [Intel Neural Compressor (INC)](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html), which provides model measurement and quantization capabilities in PyTorch. -More information on enabling FP8 in SynapseAI is available here: -https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html +### Multi-cards inference with BF16 -Here is an example to measure the tensor quantization statistics on Llava-1.5-7b: -```bash -QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \ - --model_name_or_path llava-hf/llava-1.5-7b-hf \ - --image_path "https://llava-vl.github.io/static/images/view.jpg" \ - --use_hpu_graphs \ - --bf16 \ - --sdp_on_bf16 -``` - -Here is an example to quantize the model based on previous measurements for Llava-1.5-7b: +Use the following commands to run Llama-3.2-90B-Vision-Instruct BF16 inference with FusedSDPA on 8 HPUs: ```bash -QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python run_pipeline.py \ - --model_name_or_path llava-hf/llava-1.5-7b-hf \ +PT_HPU_ENABLE_LAZY_COLLECTIVES=true python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \ + --model_name_or_path meta-llama/Llama-3.2-90B-Vision-Instruct \ --image_path "https://llava-vl.github.io/static/images/view.jpg" \ --use_hpu_graphs \ --bf16 \ - --sdp_on_bf16 + --use_flash_attention \ + --flash_attention_recompute ``` +## Inference with FP8 -Here is an example to measure the tensor quantization statistics on Llava-v1.6-mistral-7b: -```bash -QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \ - --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \ - --image_path "https://llava-vl.github.io/static/images/view.jpg" \ - --use_hpu_graphs \ - --bf16 \ - --sdp_on_bf16 -``` - -Here is an example to quantize the model based on previous measurements for Llava-v1.6-mistral-7b: -```bash -QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python run_pipeline.py \ - --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \ - --image_path "https://llava-vl.github.io/static/images/view.jpg" \ - --use_hpu_graphs \ - --bf16 \ - --sdp_on_bf16 -``` +Inference with FP8 precision is enabled using [Intel Neural Compressor (INC)](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/index.html?highlight=inc), which provides model measurement and quantization capabilities in PyTorch. +More information on enabling FP8 in SynapseAI is available here: +[Run Inference Using FP8](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html?highlight=fp8) -Here is an example to measure the tensor quantization statistics on Llava-v1.6-vicuna-13b: +### Single card inference with FP8 +Here is an example to measure the tensor quantization statistics on Llava-v1.6-vicuna-13b with SDPA: ```bash QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \ --model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \ @@ -188,7 +64,7 @@ QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \ --sdp_on_bf16 ``` -Here is an example to quantize the model based on previous measurements for Llava-v1.6-vicuna-13b: +Here is an example to quantize the model based on previous measurements for Llava-v1.6-vicuna-13b with SDPA: ```bash QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python run_pipeline.py \ --model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \ @@ -198,25 +74,10 @@ QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python r --sdp_on_bf16 ``` -### Inference with FusedSDPA - -Habana FusedSDPA is a fused and optimized implementation of torch.nn.functional.scaled_dot_product_attention() for Gaudi. For more details, refer to [Gaudi online documentation](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_PyTorch_Models.html?highlight=fusedsdpa#using-fused-scaled-dot-product-attention-fusedsdpa). - -Use the following command to run Llava-1.5-7b BF16 inference with FusedSDPA -```bash -python3 run_pipeline.py \ - --model_name_or_path llava-hf/llava-1.5-7b-hf \ - --image_path "https://llava-vl.github.io/static/images/view.jpg" \ - --use_hpu_graphs \ - --bf16 \ - --use_flash_attention \ - --flash_attention_recompute -``` - - -Use the following command to run Llava-v1.6-mistral-7b BF16 inference with FusedSDPA +### Multi-cards inference with FP8 +Here is an example of measuring the tensor quantization statistics on Llava-v1.6-mistral-7b with FusedSDPA on 8 HPUs: ```bash -python3 run_pipeline.py \ +QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \ --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \ --image_path "https://llava-vl.github.io/static/images/view.jpg" \ --use_hpu_graphs \ @@ -225,12 +86,9 @@ python3 run_pipeline.py \ --flash_attention_recompute ``` - -Use the following commands to run Llava-v1.6-mistral-7b FP8 inference with FusedSDPA - -Here is an example of measuring the tensor quantization statistics on Llava-v1.6-mistral-7b: +Here is an example of quantizing the model based on previous measurements for Llava-v1.6-mistral-7b with FusedSDPA on 8 HPUs: ```bash -QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \ +QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \ --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \ --image_path "https://llava-vl.github.io/static/images/view.jpg" \ --use_hpu_graphs \ @@ -239,88 +97,8 @@ QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \ --flash_attention_recompute ``` -Here is an example of quantizing the model based on previous measurements for Llava-v1.6-mistral-7b: -```bash -QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python run_pipeline.py \ - --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \ - --image_path "https://llava-vl.github.io/static/images/view.jpg" \ - --use_hpu_graphs \ - --bf16 \ - --use_flash_attention \ - --flash_attention_recompute -``` ## LORA Finetune -To run LoRA finetuning, you can use `run_image2text_lora_finetune.py`. -Here are single-/multi-device command examples for HuggingFaceM4/idefics2-8b. - -```bash -python3 run_image2text_lora_finetune.py \ - --model_name_or_path HuggingFaceM4/idefics2-8b \ - --dataset_name nielsr/docvqa_1200_examples \ - --bf16 True \ - --output_dir ./model_lora_llama \ - --num_train_epochs 1 \ - --per_device_train_batch_size 2 \ - --per_device_eval_batch_size 2 \ - --gradient_accumulation_steps 8 \ - --weight_decay 0.01 \ - --logging_steps 25 \ - --eval_strategy "no" \ - --save_strategy "no" \ - --learning_rate 5e-5 \ - --warmup_steps 50 \ - --lr_scheduler_type "constant" \ - --input_column_names 'image' 'query' \ - --output_column_names 'answers' \ - --remove_unused_columns False \ - --do_train \ - --do_eval \ - --use_habana \ - --use_lazy_mode \ - --lora_rank=8 \ - --lora_alpha=8 \ - --lora_dropout=0.1 \ - --max_seq_length=512 \ - --use_hpu_graphs_for_inference \ - --low_cpu_mem_usage True \ - --lora_target_modules '.*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$' -``` - -```bash -python3 ../gaudi_spawn.py \ - --world_size 8 --use_mpi run_image2text_lora_finetune.py \ - --model_name_or_path HuggingFaceM4/idefics2-8b \ - --dataset_name nielsr/docvqa_1200_examples \ - --bf16 True \ - --output_dir ./model_lora_llama \ - --num_train_epochs 1 \ - --per_device_train_batch_size 2 \ - --per_device_eval_batch_size 2 \ - --gradient_accumulation_steps 8 \ - --weight_decay 0.01 \ - --logging_steps 25 \ - --eval_strategy "no" \ - --save_strategy "no" \ - --learning_rate 5e-5 \ - --warmup_steps 50 \ - --lr_scheduler_type "constant" \ - --input_column_names 'image' 'query' \ - --output_column_names 'answers' \ - --remove_unused_columns False \ - --do_train \ - --do_eval \ - --use_habana \ - --use_lazy_mode \ - --lora_rank=8 \ - --lora_alpha=8 \ - --lora_dropout=0.1 \ - --max_seq_length=512 \ - --use_hpu_graphs_for_inference \ - --low_cpu_mem_usage True \ - --lora_target_modules '".*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$"' -``` - Here are single-/multi-device command examples for meta-llama/Llama-3.2-11B-Vision-Instruct. ```bash @@ -390,54 +168,6 @@ python3 ../gaudi_spawn.py \ --lora_target_modules '".*(language_model).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$"' ``` -## Multi-HPU inference - -### BF16 Inference with FusedSDPA on 8 HPUs - -Use the following commands to run Llava-v1.6-mistral-7b BF16 inference with FusedSDPA on 8 HPUs: -```bash -python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \ - --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \ - --image_path "https://llava-vl.github.io/static/images/view.jpg" \ - --use_hpu_graphs \ - --bf16 \ - --use_flash_attention \ - --flash_attention_recompute -``` - -Use the following commands to run Llama-3.2-90B-Vision-Instruct BF16 inference with FusedSDPA on 8 HPUs: -```bash -PT_HPU_ENABLE_LAZY_COLLECTIVES=true python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \ - --model_name_or_path meta-llama/Llama-3.2-90B-Vision-Instruct \ - --image_path "https://llava-vl.github.io/static/images/view.jpg" \ - --use_hpu_graphs \ - --bf16 \ - --use_flash_attention \ - --flash_attention_recompute -``` - - -### FP8 Inference with FusedSDPA on 8 HPUs - -Use the following commands to run Llava-v1.6-mistral-7b FP8 inference with FusedSDPA on 8 HPUs. -Here is an example of measuring the tensor quantization statistics on Llava-v1.6-mistral-7b on 8 HPUs: -```bash -QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \ - --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \ - --image_path "https://llava-vl.github.io/static/images/view.jpg" \ - --use_hpu_graphs \ - --bf16 \ - --use_flash_attention \ - --flash_attention_recompute -``` - -Here is an example of quantizing the model based on previous measurements for Llava-v1.6-mistral-7b on 8 HPUs: -```bash -QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \ - --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \ - --image_path "https://llava-vl.github.io/static/images/view.jpg" \ - --use_hpu_graphs \ - --bf16 \ - --use_flash_attention \ - --flash_attention_recompute -``` +> For different models, please adjust training parameters and `lora_target_modules`. Such as replace `lora_target_modules` +> with below for HuggingFaceM4/idefics2-8b. +> '".*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$"' diff --git a/examples/text-feature-extraction/README.md b/examples/text-feature-extraction/README.md index 2b0d5354ef..e46168840b 100644 --- a/examples/text-feature-extraction/README.md +++ b/examples/text-feature-extraction/README.md @@ -31,10 +31,3 @@ python run_feature_extraction.py \ --sdp_on_bf16 \ --bf16 ``` - -Models that have been validated: - -- [Supabase/gte-small](https://huggingface.co/Supabase/gte-small) -- [thenlper/gte-small](https://huggingface.co/thenlper/gte-small) -- [thenlper/gte-base](https://huggingface.co/thenlper/gte-base) -- [thenlper/gte-large](https://huggingface.co/thenlper/gte-large)