Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CB: preparation for relying on KV cache precisions from plugins #1634

Merged

Conversation

ilya-lavrenov
Copy link
Contributor

@ilya-lavrenov ilya-lavrenov commented Jan 27, 2025

  • Currently we have logic to detect KV cache precision and this logic become more and more complex
  • The idea is to rely on plugin's logic and compiled PA model with ov::element::dynamic precisions for KV cache inputs.
  • Later, take ov::CompiledModel and extract precisions from its inputs()
  • Then create tensors based on computed num_kv_blocks which depends on KV cache precisions.

Currently, logic to mimic plugin's logic for KV cache precisions is still here, but will be dropped once plugin will support ov::element::dynamic

@github-actions github-actions bot added category: visual language Visual language pipeline category: continuous batching Continuous batching category: whisper Whisper pipeline category: speculative decoding Speculative decoding category: cmake / build Cmake scripts category: tokenizers Tokenizer class or submodule update no-match-files labels Jan 27, 2025
@ilya-lavrenov ilya-lavrenov marked this pull request as draft January 27, 2025 13:49
@ilya-lavrenov ilya-lavrenov force-pushed the reorg-kv-cache-precision branch from 698b2b1 to 9fce60a Compare January 27, 2025 13:52
@github-actions github-actions bot removed category: visual language Visual language pipeline category: whisper Whisper pipeline category: tokenizers Tokenizer class or submodule update labels Jan 27, 2025
@ilya-lavrenov ilya-lavrenov force-pushed the reorg-kv-cache-precision branch 5 times, most recently from b491aa9 to b380ccc Compare January 27, 2025 14:06

// allow a plugin to automatically set KV cache precisions
k->set_element_type(ov::element::undefined);
v->set_element_type(ov::element::undefined);
Copy link
Contributor Author

@ilya-lavrenov ilya-lavrenov Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luo-cheng2021 @sshlyapn
Here I make KV cache precisions in original PA model to be dynamic and CB wants to rely on the same logic for KV cache precisions as plugins make for SPDA case.

The rest of other changes serve to respect precisions for ov::InferRequest of produced by plugins

Could you please make branches for CPU / GPU in OV repo to comply with these changes?

ov::CompiledModel compiled_model;

// TODO: remove once plugin automatically set KV cache precisions
apply_kv_cache_precision(model, device_config.get_device(), properties);
Copy link
Contributor Author

@ilya-lavrenov ilya-lavrenov Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luo-cheng2021 @sshlyapn
This is current WA for this branch to override that dynamic precisions to values guessed within apply_kv_cache_precision

It's supposed that once CPU / GPU can compile dynamic precisions and set proper types (as in SPDA), we can drop this function at all.

@ilya-lavrenov ilya-lavrenov force-pushed the reorg-kv-cache-precision branch from b380ccc to cc8ea52 Compare January 27, 2025 14:11
@ilya-lavrenov ilya-lavrenov added this to the 2025.1 milestone Jan 27, 2025
@ilya-lavrenov ilya-lavrenov self-assigned this Jan 27, 2025
@github-actions github-actions bot added the category: GHA CI based on Github actions label Jan 27, 2025
@ilya-lavrenov ilya-lavrenov force-pushed the reorg-kv-cache-precision branch 4 times, most recently from 6aaf043 to 007c29c Compare January 28, 2025 12:39
@ilya-lavrenov ilya-lavrenov changed the title CB: rely on KV cache precisions from plugins CB: preparation for relying on KV cache precisions from plugins Jan 28, 2025
@ilya-lavrenov ilya-lavrenov marked this pull request as ready for review January 28, 2025 12:39
Comment on lines +202 to +167
auto key_size = ov::shape_size(key_cache_shape) * key_precision.size();
auto value_size = ov::shape_size(value_cache_shape) * value_precision.size();

ov::Tensor key_cache(key_precision, key_cache_shape, TensorMmapAllocator(key_size));
ov::Tensor value_cache(value_precision, value_cache_shape, TensorMmapAllocator(value_size));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to allocate zero-filled memory for Linux? If not, this code can be removed, as TensorMmapAllocator was only needed as replacement for memset().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

It can be dropped after openvinotoolkit/openvino#28681 is merged

@ilya-lavrenov ilya-lavrenov force-pushed the reorg-kv-cache-precision branch from d59bd18 to 86d2cbf Compare January 28, 2025 18:13
@ilya-lavrenov ilya-lavrenov force-pushed the reorg-kv-cache-precision branch from 86d2cbf to 65936c4 Compare January 28, 2025 18:14
for (auto & name : input.get_names()) {
auto cache_precision = input.get_element_type();

if (name.find("key_cache.") == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor:
Potentially could be defined as global constant according multiple usage in the project.
The same for "value_cache."

auto pshape = patch_shape(device_config.get_key_cache_shape(kv_input_index), cache_precision);
m_key_shapes.push_back(pshape);
m_key_precisions.push_back(cache_precision);
m_block_size_in_bytes += pshape[1].get_length() * pshape[2].get_length() * pshape[3].get_length() * cache_precision.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor:

size_t get_block_size_in_bytes(const ov::PartialShare& pshape, const ov::element_type& type) {
    return pshape[1].get_length() * pshape[2].get_length() * pshape[3].get_length() * type.size();
}

...

m_block_size_in_bytes += get_block_size_in_bytes(pshape, cache_precision); // for key and value

@ilya-lavrenov ilya-lavrenov added this pull request to the merge queue Jan 29, 2025
@ilya-lavrenov ilya-lavrenov removed this pull request from the merge queue due to a manual request Jan 29, 2025
@ilya-lavrenov ilya-lavrenov merged commit 5cbadd1 into openvinotoolkit:master Jan 29, 2025
62 checks passed
@ilya-lavrenov ilya-lavrenov deleted the reorg-kv-cache-precision branch January 29, 2025 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants