Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Snippets][CPU] Disable MHA tokenization in LLM #28601

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

a-sidorova
Copy link
Contributor

@a-sidorova a-sidorova commented Jan 22, 2025

Details:

  • The second inference in LLM is usually single token inference. It means that M dimension of MatMuls in SDPA pattern will have the value 1 (during compilation model this dimension is dynamic (unknown)). Snippets cannot provide efficient execution for single token inference. So we decided to disable MHA tokenization by Snippets in CPU Plugin on LLMs'. We consider the presence of ScaledDotProductAttentionWithKVCache op in the model as a sign that this model is LLM.

Tickets:

  • 160634
  • 160978

TODO:

  • Performance validation on LLMs (the results are in the ticket CVS-160978)

@a-sidorova a-sidorova requested review from a team as code owners January 22, 2025 07:01
@github-actions github-actions bot added the category: CPU OpenVINO CPU plugin label Jan 22, 2025
@a-sidorova a-sidorova added this to the 2025.1 milestone Jan 22, 2025
@dmitry-gorokhov dmitry-gorokhov self-assigned this Jan 22, 2025
@a-sidorova a-sidorova force-pushed the feature/snippets/disable_mha_token_in_llm branch 2 times, most recently from 0e62943 to 08aeea7 Compare January 22, 2025 09:04
Comment on lines 1040 to 1042
// Note: the variable `ops` should not exist during `SnippetsTokenization` execution.
// Otherwise, it will extend the life time of ops (since they're stored as shared ptrs) and
// they will be visible in the model during the tokenization passes even after removing or replacing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: this comment looks a bit confusing to be honest. Had to read it a few times, but still don't fully understand it.
It looks like you compare the present check with some other solution that could've been implemented (i.e. store ops somewhere and access them from the tokenization pass?). Not sure we need to do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is outdated now since I use ov::op::util::has_op_with_type<OP>(model) in cad0554. Thanks!

@a-sidorova a-sidorova force-pushed the feature/snippets/disable_mha_token_in_llm branch from cad0554 to e432b62 Compare January 23, 2025 07:13
akladiev pushed a commit that referenced this pull request Jan 23, 2025
### Details:
- *The second inference in LLM is usually single token inference. It
means that `M` dimension of MatMuls in SDPA pattern will have the value
`1` (during compilation model this dimension is dynamic (unknown)).
Snippets cannot provide efficient execution for single token inference.
So we decided to disable MHA tokenization by Snippets in CPU Plugin on
LLMs'. We consider the presence of
`ScaledDotProductAttentionWithKVCache` op in the model as a sign that
this model is LLM.*
- *Cherry-picked from
#28601

### Tickets:
 - *160634*
 - *160978 (contains performance validation results)*
@a-sidorova a-sidorova force-pushed the feature/snippets/disable_mha_token_in_llm branch from e432b62 to 8e13f0c Compare January 23, 2025 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: CPU OpenVINO CPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants