Skip to content

Commit

Permalink
do_sample=False for NPU in chat_sample, add NPU to README
Browse files Browse the repository at this point in the history
  • Loading branch information
helena-intel committed Jan 28, 2025
1 parent 3b016df commit 3299ac7
Show file tree
Hide file tree
Showing 3 changed files with 20 additions and 1 deletion.
13 changes: 12 additions & 1 deletion samples/cpp/text_generation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ optimim-cli export openvino --model <model> <output_folder>
```
If a converted model in OpenVINO IR format is already available in the collection of [OpenVINO optimized LLMs](https://huggingface.co/collections/OpenVINO/llm-6687aaa2abca3bbcec71a9bd) on Hugging Face, it can be downloaded directly via huggingface-cli.
```sh
pip install --upgrade-strategy eager -r ../../export-requirements.txt
pip install huggingface-hub
huggingface-cli download <model> --local-dir <output_folder>
```

Expand Down Expand Up @@ -54,6 +54,17 @@ The following template can be used as a default, but it may not work properly wi
"chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n<|im_start|>assistant\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>\n'}}{% endif %}{% endfor %}",
```

#### NPU support

NPU device is supported with some limitations. See [NPU inference of
LLMs](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html) documentation. In particular:

- Models must be exported with symmetric INT4 quantization (`optimum-cli export openvino --weight-format int4 --sym --model <model> <output_folder>`).
For models with more than 4B parameters, channel wise quantization should be used (`--group-size -1`).
- Only greedy search is supported (`do_sample` must be set to False)
- Use OpenVINO 2024.6 or later, and the latest NPU driver.


### 2. Greedy Causal LM (`greedy_causal_lm`)
- **Description:**
Basic text generation using a causal language model.
Expand Down
5 changes: 5 additions & 0 deletions samples/cpp/text_generation/chat_sample.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,11 @@ int main(int argc, char* argv[]) try {

ov::genai::GenerationConfig config;
config.max_new_tokens = 100;
// do_sample must be set to false for NPU because NPU only supports greedy search
if (device == "NPU") {
config.do_sample = false;
}

std::function<bool(std::string)> streamer = [](std::string word) {
std::cout << word << std::flush;
// Return flag corresponds whether generation should be stopped.
Expand Down
3 changes: 3 additions & 0 deletions samples/python/text_generation/chat_sample.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@ def main():

config = openvino_genai.GenerationConfig()
config.max_new_tokens = 100
# do_sample must be set to False for NPU because NPU only supports greedy search
if device == 'NPU':
config.do_sample = False

pipe.start_chat()
while True:
Expand Down

0 comments on commit 3299ac7

Please sign in to comment.