From 7bb340defa0b8ca1542c6b89238c2be06a08e7cb Mon Sep 17 00:00:00 2001
From: Libin Tang <litang@habana.ai>
Date: Fri, 24 Jan 2025 14:14:15 -0800
Subject: [PATCH] Readme modification (#1700)

Co-authored-by: Vidya Galli <vidya.s.galli@intel.com>
Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
---
 examples/image-to-text/README.md           | 322 ++-------------------
 examples/text-feature-extraction/README.md |   7 -
 2 files changed, 26 insertions(+), 303 deletions(-)

diff --git a/examples/image-to-text/README.md b/examples/image-to-text/README.md
index e4dbb05472..7a8ad04664 100644
--- a/examples/image-to-text/README.md
+++ b/examples/image-to-text/README.md
@@ -17,111 +17,12 @@ limitations under the License.
 # Image to Text Examples
 This directory contains a script that showcases how to perform image to text generation on Intel® Gaudi® AI Accelerators.
 
-## Single-HPU inference
+Habana FusedSDPA is a fused and optimized implementation of torch.nn.functional.scaled_dot_product_attention() for Gaudi. For more details, refer to [Gaudi online documentation](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_PyTorch_Models.html?highlight=fusedsdpa#using-fused-scaled-dot-product-attention-fusedsdpa). We optimized many models with FusedSDPA implementation as in [optimum/habana/transformers/models](https://github.com/huggingface/optimum-habana/tree/main/optimum/habana/transformers/models). If a model is not optimized with FusedSDPA, it uses [SDPA implementation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html).
 
-Models that have been validated:
-  - [nlpconnect/vit-gpt2-image-captioning](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning)
-  - [Salesforce/blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large)
-  - [Salesforce/blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base)
-  - [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf)
-  - [llava-hf/llava-1.5-13b-hf](https://huggingface.co/llava-hf/llava-1.5-13b-hf)
-  - [llava-hf/llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf)
-  - [llava-hf/llava-v1.6-vicuna-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-7b-hf)
-  - [llava-hf/llava-v1.6-vicuna-13b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf)
-  - [llava-hf/llava-v1.6-34b-hf](https://huggingface.co/llava-hf/llava-v1.6-34b-hf)
-  - [llava-hf/llama3-llava-next-8b-hf](https://huggingface.co/llava-hf/llama3-llava-next-8b-hf)
-  - [HuggingFaceM4/idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b)
-  - [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)
-  - [meta-llama/Llama-3.2-90B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct)
-  - [tiiuae/falcon-11B-vlm](https://huggingface.co/tiiuae/falcon-11B-vlm)
-  - [google/paligemma-3b-mix-224](https://huggingface.co/google/paligemma-3b-mix-224)
+## Inference with mixed-precision (BF16)
 
-### Inference with BF16
-
-To run Salesforce/blip-image-captioning-large inference, use the following command:
-```bash
-python3 run_pipeline.py \
-    --model_name_or_path Salesforce/blip-image-captioning-large \
-    --image_path "https://ankur3107.github.io/assets/images/image-captioning-example.png" \
-    --use_hpu_graphs \
-    --bf16 \
-    --sdp_on_bf16
-```
-
-To run Llava-1.5-7b inference, use the following command:
-```bash
-python3 run_pipeline.py \
-    --model_name_or_path llava-hf/llava-1.5-7b-hf \
-    --use_hpu_graphs \
-    --bf16 \
-    --sdp_on_bf16
-```
-
-To run Llava-1.5-13b inference, use the following command:
-```bash
-python3 run_pipeline.py \
-    --model_name_or_path llava-hf/llava-1.5-13b-hf \
-    --use_hpu_graphs \
-    --bf16 \
-    --sdp_on_bf16
-```
-
-To run Llava-v1.6-mistral-7b inference, use the following command:
-```bash
-python3 run_pipeline.py \
-    --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
-    --use_hpu_graphs \
-    --bf16 \
-    --sdp_on_bf16
-```
-
-To run Llava-v1.6-vicuna-13b inference, use the following command:
-```bash
-python3 run_pipeline.py \
-    --model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \
-    --use_hpu_graphs \
-    --bf16 \
-    --sdp_on_bf16
-```
-
-To run Llava-hf/llava-v1.6-34b-hf inference, use the following command:
-```bash
-python3 run_pipeline.py \
-    --model_name_or_path llava-hf/llava-v1.6-34b-hf \
-    --use_hpu_graphs \
-    --bf16 \
-    --sdp_on_bf16
-```
-
-To run google/paligemma-3b-mix-224 inference, use the following command:
-```bash
-python3 run_pipeline.py \
-    --model_name_or_path google/paligemma-3b-mix-224 \
-    --use_hpu_graphs \
-    --bf16 \
-    --sdp_on_bf16
-```
-
-To run Llava-hf/llama3-llava-next-8b-hf inference, use the following command:
-```bash
-python3 run_pipeline.py \
-    --model_name_or_path llava-hf/llama3-llava-next-8b-hf \
-    --use_hpu_graphs \
-    --bf16 \
-    --sdp_on_bf16
-```
-
-To run idefics2 inference, use the following command:
-
-```bash
-python3 run_pipeline.py \
-    --model_name_or_path HuggingFaceM4/idefics2-8b \
-    --use_hpu_graphs \
-    --bf16 \
-    --sdp_on_bf16
-```
-
-To run mllama inference using reduced precision in the SDPA, use the following command:
+### Single card inference with BF16
+To run Llama inference with SDPA, use the following command:
 
 ```bash
 python3 run_pipeline.py \
@@ -130,55 +31,30 @@ python3 run_pipeline.py \
     --bf16 \
     --sdp_on_bf16
 ```
+> SDPA may introduce [reduced precison](https://pytorch.org/docs/stable/notes/numerical_accuracy.html#reduced-precision-reduction-for-fp16-and-bf16-in-scaled-dot-product-attention-sdpa)
 
-### Inference with FP8
-Inference for Llava-1.5-7b, Llava-1.5-13b, Llava-v1.6-mistral-7b and Llava-v1.6-vicuna-13b in FP8 precision are enabled using  [Intel Neural Compressor (INC)](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html), which provides model measurement and quantization capabilities in PyTorch.
 
-More information on enabling FP8 in SynapseAI is available here:
-https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html
+### Multi-cards inference with BF16
 
-Here is an example to measure the tensor quantization statistics on Llava-1.5-7b:
-```bash
-QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \
-    --model_name_or_path llava-hf/llava-1.5-7b-hf \
-    --image_path "https://llava-vl.github.io/static/images/view.jpg" \
-    --use_hpu_graphs \
-    --bf16 \
-    --sdp_on_bf16
-```
-
-Here is an example to quantize the model based on previous measurements for Llava-1.5-7b:
+Use the following commands to run Llama-3.2-90B-Vision-Instruct BF16 inference with FusedSDPA on 8 HPUs:
 ```bash
-QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python run_pipeline.py \
-    --model_name_or_path llava-hf/llava-1.5-7b-hf \
+PT_HPU_ENABLE_LAZY_COLLECTIVES=true python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
+    --model_name_or_path meta-llama/Llama-3.2-90B-Vision-Instruct \
     --image_path "https://llava-vl.github.io/static/images/view.jpg" \
     --use_hpu_graphs \
     --bf16 \
-    --sdp_on_bf16
+    --use_flash_attention \
+    --flash_attention_recompute
 ```
 
+## Inference with FP8
 
-Here is an example to measure the tensor quantization statistics on Llava-v1.6-mistral-7b:
-```bash
-QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \
-    --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
-    --image_path "https://llava-vl.github.io/static/images/view.jpg" \
-    --use_hpu_graphs \
-    --bf16 \
-    --sdp_on_bf16
-```
-
-Here is an example to quantize the model based on previous measurements for Llava-v1.6-mistral-7b:
-```bash
-QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python run_pipeline.py \
-    --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
-    --image_path "https://llava-vl.github.io/static/images/view.jpg" \
-    --use_hpu_graphs \
-    --bf16 \
-    --sdp_on_bf16
-```
+Inference with FP8 precision is enabled using [Intel Neural Compressor (INC)](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/index.html?highlight=inc), which provides model measurement and quantization capabilities in PyTorch.
+More information on enabling FP8 in SynapseAI is available here:
+[Run Inference Using FP8](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html?highlight=fp8)
 
-Here is an example to measure the tensor quantization statistics on Llava-v1.6-vicuna-13b:
+### Single card inference with FP8
+Here is an example to measure the tensor quantization statistics on Llava-v1.6-vicuna-13b with SDPA:
 ```bash
 QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \
     --model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \
@@ -188,7 +64,7 @@ QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \
     --sdp_on_bf16
 ```
 
-Here is an example to quantize the model based on previous measurements for Llava-v1.6-vicuna-13b:
+Here is an example to quantize the model based on previous measurements for Llava-v1.6-vicuna-13b with SDPA:
 ```bash
 QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python run_pipeline.py \
     --model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \
@@ -198,25 +74,10 @@ QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python r
     --sdp_on_bf16
 ```
 
-### Inference with FusedSDPA
-
-Habana FusedSDPA is a fused and optimized implementation of torch.nn.functional.scaled_dot_product_attention() for Gaudi. For more details, refer to [Gaudi online documentation](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_PyTorch_Models.html?highlight=fusedsdpa#using-fused-scaled-dot-product-attention-fusedsdpa).
-
-Use the following command to run Llava-1.5-7b BF16 inference with FusedSDPA
-```bash
-python3 run_pipeline.py \
-    --model_name_or_path llava-hf/llava-1.5-7b-hf \
-    --image_path "https://llava-vl.github.io/static/images/view.jpg" \
-    --use_hpu_graphs \
-    --bf16 \
-    --use_flash_attention \
-    --flash_attention_recompute
-```
-
-
-Use the following command to run Llava-v1.6-mistral-7b BF16 inference with FusedSDPA
+### Multi-cards inference with FP8
+Here is an example of measuring the tensor quantization statistics on Llava-v1.6-mistral-7b with FusedSDPA on 8 HPUs:
 ```bash
-python3 run_pipeline.py \
+QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
     --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
     --image_path "https://llava-vl.github.io/static/images/view.jpg" \
     --use_hpu_graphs \
@@ -225,12 +86,9 @@ python3 run_pipeline.py \
     --flash_attention_recompute
 ```
 
-
-Use the following commands to run Llava-v1.6-mistral-7b FP8 inference with FusedSDPA
-
-Here is an example of measuring the tensor quantization statistics on Llava-v1.6-mistral-7b:
+Here is an example of quantizing the model based on previous measurements for Llava-v1.6-mistral-7b with FusedSDPA on 8 HPUs:
 ```bash
-QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \
+QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
     --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
     --image_path "https://llava-vl.github.io/static/images/view.jpg" \
     --use_hpu_graphs \
@@ -239,88 +97,8 @@ QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \
     --flash_attention_recompute
 ```
 
-Here is an example of quantizing the model based on previous measurements for Llava-v1.6-mistral-7b:
-```bash
-QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python run_pipeline.py \
-    --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
-    --image_path "https://llava-vl.github.io/static/images/view.jpg" \
-    --use_hpu_graphs \
-    --bf16 \
-    --use_flash_attention \
-    --flash_attention_recompute
-```
 ## LORA Finetune
 
-To run LoRA finetuning, you can use `run_image2text_lora_finetune.py`.
-Here are single-/multi-device command examples for HuggingFaceM4/idefics2-8b.
-
-```bash
-python3 run_image2text_lora_finetune.py \
-    --model_name_or_path HuggingFaceM4/idefics2-8b \
-    --dataset_name nielsr/docvqa_1200_examples \
-    --bf16 True \
-    --output_dir ./model_lora_llama \
-    --num_train_epochs 1 \
-    --per_device_train_batch_size 2 \
-    --per_device_eval_batch_size 2 \
-    --gradient_accumulation_steps 8 \
-    --weight_decay 0.01 \
-    --logging_steps 25 \
-    --eval_strategy "no" \
-    --save_strategy "no" \
-    --learning_rate 5e-5 \
-    --warmup_steps  50 \
-    --lr_scheduler_type "constant" \
-    --input_column_names 'image' 'query' \
-    --output_column_names 'answers' \
-    --remove_unused_columns False \
-    --do_train \
-    --do_eval \
-    --use_habana \
-    --use_lazy_mode \
-    --lora_rank=8 \
-    --lora_alpha=8 \
-    --lora_dropout=0.1 \
-    --max_seq_length=512 \
-    --use_hpu_graphs_for_inference \
-    --low_cpu_mem_usage True \
-    --lora_target_modules '.*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$'
-```
-
-```bash
-python3 ../gaudi_spawn.py \
-    --world_size 8 --use_mpi run_image2text_lora_finetune.py \
-    --model_name_or_path HuggingFaceM4/idefics2-8b \
-    --dataset_name nielsr/docvqa_1200_examples \
-    --bf16 True \
-    --output_dir ./model_lora_llama \
-    --num_train_epochs 1 \
-    --per_device_train_batch_size 2 \
-    --per_device_eval_batch_size 2 \
-    --gradient_accumulation_steps 8 \
-    --weight_decay 0.01 \
-    --logging_steps 25 \
-    --eval_strategy "no" \
-    --save_strategy "no" \
-    --learning_rate 5e-5 \
-    --warmup_steps  50 \
-    --lr_scheduler_type "constant" \
-    --input_column_names 'image' 'query' \
-    --output_column_names 'answers' \
-    --remove_unused_columns False \
-    --do_train \
-    --do_eval \
-    --use_habana \
-    --use_lazy_mode \
-    --lora_rank=8 \
-    --lora_alpha=8 \
-    --lora_dropout=0.1 \
-    --max_seq_length=512 \
-    --use_hpu_graphs_for_inference \
-    --low_cpu_mem_usage True \
-    --lora_target_modules '".*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$"'
-```
-
 Here are single-/multi-device command examples for meta-llama/Llama-3.2-11B-Vision-Instruct.
 
 ```bash
@@ -390,54 +168,6 @@ python3 ../gaudi_spawn.py \
     --lora_target_modules '".*(language_model).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$"'
 ```
 
-## Multi-HPU inference
-
-### BF16 Inference with FusedSDPA on 8 HPUs
-
-Use the following commands to run Llava-v1.6-mistral-7b BF16 inference with FusedSDPA on 8 HPUs:
-```bash
-python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
-    --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
-    --image_path "https://llava-vl.github.io/static/images/view.jpg" \
-    --use_hpu_graphs \
-    --bf16 \
-    --use_flash_attention \
-    --flash_attention_recompute
-```
-
-Use the following commands to run Llama-3.2-90B-Vision-Instruct BF16 inference with FusedSDPA on 8 HPUs:
-```bash
-PT_HPU_ENABLE_LAZY_COLLECTIVES=true python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
-    --model_name_or_path meta-llama/Llama-3.2-90B-Vision-Instruct \
-    --image_path "https://llava-vl.github.io/static/images/view.jpg" \
-    --use_hpu_graphs \
-    --bf16 \
-    --use_flash_attention \
-    --flash_attention_recompute
-```
-
-
-### FP8 Inference with FusedSDPA on 8 HPUs
-
-Use the following commands to run Llava-v1.6-mistral-7b FP8 inference with FusedSDPA on 8 HPUs.
-Here is an example of measuring the tensor quantization statistics on Llava-v1.6-mistral-7b on 8 HPUs:
-```bash
-QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
-    --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
-    --image_path "https://llava-vl.github.io/static/images/view.jpg" \
-    --use_hpu_graphs \
-    --bf16 \
-    --use_flash_attention \
-    --flash_attention_recompute
-```
-
-Here is an example of quantizing the model based on previous measurements for Llava-v1.6-mistral-7b on 8 HPUs:
-```bash
-QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
-    --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
-    --image_path "https://llava-vl.github.io/static/images/view.jpg" \
-    --use_hpu_graphs \
-    --bf16 \
-    --use_flash_attention \
-    --flash_attention_recompute
-```
+>  For different models, please adjust training parameters and `lora_target_modules`. Such as replace `lora_target_modules`
+>  with below for HuggingFaceM4/idefics2-8b.
+>  '".*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$"'
diff --git a/examples/text-feature-extraction/README.md b/examples/text-feature-extraction/README.md
index 2b0d5354ef..e46168840b 100644
--- a/examples/text-feature-extraction/README.md
+++ b/examples/text-feature-extraction/README.md
@@ -31,10 +31,3 @@ python run_feature_extraction.py \
     --sdp_on_bf16 \
     --bf16
 ```
-
-Models that have been validated:
-
-- [Supabase/gte-small](https://huggingface.co/Supabase/gte-small)
-- [thenlper/gte-small](https://huggingface.co/thenlper/gte-small)
-- [thenlper/gte-base](https://huggingface.co/thenlper/gte-base)
-- [thenlper/gte-large](https://huggingface.co/thenlper/gte-large)