-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Snippets][CPU] Disable MHA tokenization in LLM #28601
base: master
Are you sure you want to change the base?
[Snippets][CPU] Disable MHA tokenization in LLM #28601
Conversation
0e62943
to
08aeea7
Compare
// Note: the variable `ops` should not exist during `SnippetsTokenization` execution. | ||
// Otherwise, it will extend the life time of ops (since they're stored as shared ptrs) and | ||
// they will be visible in the model during the tokenization passes even after removing or replacing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: this comment looks a bit confusing to be honest. Had to read it a few times, but still don't fully understand it.
It looks like you compare the present check with some other solution that could've been implemented (i.e. store ops somewhere and access them from the tokenization pass?). Not sure we need to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is outdated now since I use ov::op::util::has_op_with_type<OP>(model)
in cad0554. Thanks!
cad0554
to
e432b62
Compare
### Details: - *The second inference in LLM is usually single token inference. It means that `M` dimension of MatMuls in SDPA pattern will have the value `1` (during compilation model this dimension is dynamic (unknown)). Snippets cannot provide efficient execution for single token inference. So we decided to disable MHA tokenization by Snippets in CPU Plugin on LLMs'. We consider the presence of `ScaledDotProductAttentionWithKVCache` op in the model as a sign that this model is LLM.* - *Cherry-picked from #28601 ### Tickets: - *160634* - *160978 (contains performance validation results)*
e432b62
to
8e13f0c
Compare
Details:
M
dimension of MatMuls in SDPA pattern will have the value1
(during compilation model this dimension is dynamic (unknown)). Snippets cannot provide efficient execution for single token inference. So we decided to disable MHA tokenization by Snippets in CPU Plugin on LLMs'. We consider the presence ofScaledDotProductAttentionWithKVCache
op in the model as a sign that this model is LLM.Tickets:
TODO: