do_sample=False for NPU in chat_sample, add NPU to README

openvinotoolkit · Jan 28, 2025 · 3299ac7 · 3299ac7
1 parent 3b016df
commit 3299ac7
Show file tree

Hide file tree

Showing 3 changed files with 20 additions and 1 deletion.
diff --git a/samples/cpp/text_generation/README.md b/samples/cpp/text_generation/README.md
@@ -19,7 +19,7 @@ optimim-cli export openvino --model <model> <output_folder>
 ```
 If a converted model in OpenVINO IR format is already available in the collection of [OpenVINO optimized LLMs](https://huggingface.co/collections/OpenVINO/llm-6687aaa2abca3bbcec71a9bd) on Hugging Face, it can be downloaded directly via huggingface-cli.
 ```sh
-pip install --upgrade-strategy eager -r ../../export-requirements.txt
+pip install huggingface-hub
 huggingface-cli download <model> --local-dir <output_folder>
 ```
 
@@ -54,6 +54,17 @@ The following template can be used as a default, but it may not work properly wi
 "chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n<|im_start|>assistant\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>\n'}}{% endif %}{% endfor %}",
 ```
 
+#### NPU support
+
+NPU device is supported with some limitations. See [NPU inference of
+LLMs](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html) documentation. In particular:
+
+- Models must be exported with symmetric INT4 quantization (`optimum-cli export openvino --weight-format int4 --sym --model <model> <output_folder>`).
+  For models with more than 4B parameters, channel wise quantization should be used (`--group-size -1`).
+- Only greedy search is supported (`do_sample` must be set to False)
+- Use OpenVINO 2024.6 or later, and the latest NPU driver.
+
+
 ### 2. Greedy Causal LM (`greedy_causal_lm`)
 - **Description:**
 Basic text generation using a causal language model.

diff --git a/samples/cpp/text_generation/chat_sample.cpp b/samples/cpp/text_generation/chat_sample.cpp
@@ -15,6 +15,11 @@ int main(int argc, char* argv[]) try {
 
     ov::genai::GenerationConfig config;
     config.max_new_tokens = 100;
+    // do_sample must be set to false for NPU because NPU only supports greedy search
+    if (device == "NPU") {
+        config.do_sample = false;
+    }
+
     std::function<bool(std::string)> streamer = [](std::string word) { 
         std::cout << word << std::flush;
         // Return flag corresponds whether generation should be stopped.

diff --git a/samples/python/text_generation/chat_sample.py b/samples/python/text_generation/chat_sample.py
@@ -23,6 +23,9 @@ def main():
 
     config = openvino_genai.GenerationConfig()
     config.max_new_tokens = 100
+    # do_sample must be set to False for NPU because NPU only supports greedy search
+    if device == 'NPU':
+        config.do_sample = False
 
     pipe.start_chat()
     while True: