CB: preparation for relying on KV cache precisions from plugins #1634

ilya-lavrenov · 2025-01-27T13:48:57Z

Currently we have logic to detect KV cache precision and this logic become more and more complex
The idea is to rely on plugin's logic and compiled PA model with ov::element::dynamic precisions for KV cache inputs.
Later, take ov::CompiledModel and extract precisions from its inputs()
Then create tensors based on computed num_kv_blocks which depends on KV cache precisions.

Currently, logic to mimic plugin's logic for KV cache precisions is still here, but will be dropped once plugin will support ov::element::dynamic

ilya-lavrenov · 2025-01-27T14:08:41Z

src/cpp/src/utils/paged_attention_transformations.cpp

+
+        // allow a plugin to automatically set KV cache precisions
+        k->set_element_type(ov::element::undefined);
+        v->set_element_type(ov::element::undefined);


@luo-cheng2021 @sshlyapn
Here I make KV cache precisions in original PA model to be dynamic and CB wants to rely on the same logic for KV cache precisions as plugins make for SPDA case.

The rest of other changes serve to respect precisions for ov::InferRequest of produced by plugins

Could you please make branches for CPU / GPU in OV repo to comply with these changes?

ilya-lavrenov · 2025-01-27T14:10:11Z

src/cpp/src/continuous_batching_impl.cpp

    ov::CompiledModel compiled_model;

+    // TODO: remove once plugin automatically set KV cache precisions
+    apply_kv_cache_precision(model, device_config.get_device(), properties);


@luo-cheng2021 @sshlyapn
This is current WA for this branch to override that dynamic precisions to values guessed within apply_kv_cache_precision

It's supposed that once CPU / GPU can compile dynamic precisions and set proper types (as in SPDA), we can drop this function at all.

popovaan · 2025-01-28T14:58:31Z

src/cpp/src/cache_manager.hpp

+                auto key_size = ov::shape_size(key_cache_shape) * key_precision.size();
+                auto value_size = ov::shape_size(value_cache_shape) * value_precision.size();

+                ov::Tensor key_cache(key_precision, key_cache_shape, TensorMmapAllocator(key_size));
+                ov::Tensor value_cache(value_precision, value_cache_shape, TensorMmapAllocator(value_size));


Do we still need to allocate zero-filled memory for Linux? If not, this code can be removed, as TensorMmapAllocator was only needed as replacement for memset().

Yes

It can be dropped after openvinotoolkit/openvino#28681 is merged

iefode · 2025-01-29T07:23:15Z

src/cpp/src/cache_manager.hpp

+            for (auto & name : input.get_names()) {
+                auto cache_precision = input.get_element_type();
+
+                if (name.find("key_cache.") == 0) {


Minor:
Potentially could be defined as global constant according multiple usage in the project.
The same for "value_cache."

iefode · 2025-01-29T07:26:55Z

src/cpp/src/cache_manager.hpp

+                    auto pshape = patch_shape(device_config.get_key_cache_shape(kv_input_index), cache_precision);
+                    m_key_shapes.push_back(pshape);
+                    m_key_precisions.push_back(cache_precision);
+                    m_block_size_in_bytes += pshape[1].get_length() * pshape[2].get_length() * pshape[3].get_length() * cache_precision.size();


Minor:

size_t get_block_size_in_bytes(const ov::PartialShare& pshape, const ov::element_type& type) { return pshape[1].get_length() * pshape[2].get_length() * pshape[3].get_length() * type.size(); } ... m_block_size_in_bytes += get_block_size_in_bytes(pshape, cache_precision); // for key and value

ilya-lavrenov marked this pull request as draft January 27, 2025 13:49

ilya-lavrenov force-pushed the reorg-kv-cache-precision branch from 698b2b1 to 9fce60a Compare January 27, 2025 13:52

github-actions bot removed category: visual language Visual language pipeline category: whisper Whisper pipeline category: tokenizers Tokenizer class or submodule update labels Jan 27, 2025

ilya-lavrenov force-pushed the reorg-kv-cache-precision branch 5 times, most recently from b491aa9 to b380ccc Compare January 27, 2025 14:06

ilya-lavrenov commented Jan 27, 2025

View reviewed changes

ilya-lavrenov force-pushed the reorg-kv-cache-precision branch from b380ccc to cc8ea52 Compare January 27, 2025 14:11

ilya-lavrenov added this to the 2025.1 milestone Jan 27, 2025

ilya-lavrenov self-assigned this Jan 27, 2025

ilya-lavrenov requested review from sshlyapn and luo-cheng2021 January 27, 2025 22:02

github-actions bot added the category: GHA CI based on Github actions label Jan 27, 2025

ilya-lavrenov force-pushed the reorg-kv-cache-precision branch 4 times, most recently from 6aaf043 to 007c29c Compare January 28, 2025 12:39

ilya-lavrenov changed the title ~~CB: rely on KV cache precisions from plugins~~ CB: preparation for relying on KV cache precisions from plugins Jan 28, 2025

ilya-lavrenov marked this pull request as ready for review January 28, 2025 12:39

ilya-lavrenov assigned iefode and popovaan and unassigned ilya-lavrenov Jan 28, 2025

popovaan reviewed Jan 28, 2025

View reviewed changes

ilya-lavrenov added 2 commits January 28, 2025 19:12

CB: rely on KV cache precisions from plugins

831f88f

Fix

7802247

ilya-lavrenov force-pushed the reorg-kv-cache-precision branch from d59bd18 to 86d2cbf Compare January 28, 2025 18:13

Fix for Windows

65936c4

ilya-lavrenov force-pushed the reorg-kv-cache-precision branch from 86d2cbf to 65936c4 Compare January 28, 2025 18:14

iefode reviewed Jan 29, 2025

View reviewed changes

popovaan approved these changes Jan 29, 2025

View reviewed changes

ilya-lavrenov added this pull request to the merge queue Jan 29, 2025

ilya-lavrenov removed this pull request from the merge queue due to a manual request Jan 29, 2025

ilya-lavrenov merged commit 5cbadd1 into openvinotoolkit:master Jan 29, 2025
62 checks passed

ilya-lavrenov deleted the reorg-kv-cache-precision branch January 29, 2025 08:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CB: preparation for relying on KV cache precisions from plugins #1634

CB: preparation for relying on KV cache precisions from plugins #1634

ilya-lavrenov commented Jan 27, 2025 •

edited

Loading

ilya-lavrenov Jan 27, 2025 •

edited

Loading

ilya-lavrenov Jan 27, 2025 •

edited

Loading

popovaan Jan 28, 2025

ilya-lavrenov Jan 28, 2025

iefode Jan 29, 2025

iefode Jan 29, 2025

CB: preparation for relying on KV cache precisions from plugins #1634

CB: preparation for relying on KV cache precisions from plugins #1634

Conversation

ilya-lavrenov commented Jan 27, 2025 • edited Loading

ilya-lavrenov Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

ilya-lavrenov Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

popovaan Jan 28, 2025

Choose a reason for hiding this comment

ilya-lavrenov Jan 28, 2025

Choose a reason for hiding this comment

iefode Jan 29, 2025

Choose a reason for hiding this comment

iefode Jan 29, 2025

Choose a reason for hiding this comment

ilya-lavrenov commented Jan 27, 2025 •

edited

Loading

ilya-lavrenov Jan 27, 2025 •

edited

Loading

ilya-lavrenov Jan 27, 2025 •

edited

Loading