-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CB: preparation for relying on KV cache precisions from plugins #1634
CB: preparation for relying on KV cache precisions from plugins #1634
Conversation
698b2b1
to
9fce60a
Compare
b491aa9
to
b380ccc
Compare
|
||
// allow a plugin to automatically set KV cache precisions | ||
k->set_element_type(ov::element::undefined); | ||
v->set_element_type(ov::element::undefined); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luo-cheng2021 @sshlyapn
Here I make KV cache precisions in original PA model to be dynamic
and CB wants to rely on the same logic for KV cache precisions as plugins make for SPDA case.
The rest of other changes serve to respect precisions for ov::InferRequest
of produced by plugins
Could you please make branches for CPU / GPU in OV repo to comply with these changes?
ov::CompiledModel compiled_model; | ||
|
||
// TODO: remove once plugin automatically set KV cache precisions | ||
apply_kv_cache_precision(model, device_config.get_device(), properties); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luo-cheng2021 @sshlyapn
This is current WA for this branch to override that dynamic
precisions to values guessed within apply_kv_cache_precision
It's supposed that once CPU / GPU can compile dynamic
precisions and set proper types (as in SPDA), we can drop this function at all.
b380ccc
to
cc8ea52
Compare
6aaf043
to
007c29c
Compare
auto key_size = ov::shape_size(key_cache_shape) * key_precision.size(); | ||
auto value_size = ov::shape_size(value_cache_shape) * value_precision.size(); | ||
|
||
ov::Tensor key_cache(key_precision, key_cache_shape, TensorMmapAllocator(key_size)); | ||
ov::Tensor value_cache(value_precision, value_cache_shape, TensorMmapAllocator(value_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to allocate zero-filled memory for Linux? If not, this code can be removed, as TensorMmapAllocator was only needed as replacement for memset().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
It can be dropped after openvinotoolkit/openvino#28681 is merged
d59bd18
to
86d2cbf
Compare
86d2cbf
to
65936c4
Compare
for (auto & name : input.get_names()) { | ||
auto cache_precision = input.get_element_type(); | ||
|
||
if (name.find("key_cache.") == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor:
Potentially could be defined as global constant according multiple usage in the project.
The same for "value_cache."
auto pshape = patch_shape(device_config.get_key_cache_shape(kv_input_index), cache_precision); | ||
m_key_shapes.push_back(pshape); | ||
m_key_precisions.push_back(cache_precision); | ||
m_block_size_in_bytes += pshape[1].get_length() * pshape[2].get_length() * pshape[3].get_length() * cache_precision.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor:
size_t get_block_size_in_bytes(const ov::PartialShare& pshape, const ov::element_type& type) {
return pshape[1].get_length() * pshape[2].get_length() * pshape[3].get_length() * type.size();
}
...
m_block_size_in_bytes += get_block_size_in_bytes(pshape, cache_precision); // for key and value
ov::element::dynamic
precisions for KV cache inputs.ov::CompiledModel
and extract precisions from itsinputs()
num_kv_blocks
which depends on KV cache precisions.Currently, logic to mimic plugin's logic for KV cache precisions is still here, but will be dropped once plugin will support
ov::element::dynamic