[Snippets][CPU] Disable MHA tokenization in LLM #28601

a-sidorova · 2025-01-22T07:01:01Z

Details:

The second inference in LLM is usually single token inference. It means that M dimension of MatMuls in SDPA pattern will have the value 1 (during compilation model this dimension is dynamic (unknown)). Snippets cannot provide efficient execution for single token inference. So we decided to disable MHA tokenization by Snippets in CPU Plugin on LLMs'. We consider the presence of ScaledDotProductAttentionWithKVCache op in the model as a sign that this model is LLM.

Tickets:

160634
160978

TODO:

Performance validation on LLMs (the results are in the ticket CVS-160978)

IvanNovoselov · 2025-01-22T09:31:06Z

src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp

+        // Note: the variable `ops` should not exist during `SnippetsTokenization` execution.
+        //       Otherwise, it will extend the life time of ops (since they're stored as shared ptrs) and
+        //       they will be visible in the model during the tokenization passes even after removing or replacing.


Minor: this comment looks a bit confusing to be honest. Had to read it a few times, but still don't fully understand it.
It looks like you compare the present check with some other solution that could've been implemented (i.e. store ops somewhere and access them from the tokenization pass?). Not sure we need to do this.

This comment is outdated now since I use ov::op::util::has_op_with_type<OP>(model) in cad0554. Thanks!

### Details: - *The second inference in LLM is usually single token inference. It means that `M` dimension of MatMuls in SDPA pattern will have the value `1` (during compilation model this dimension is dynamic (unknown)). Snippets cannot provide efficient execution for single token inference. So we decided to disable MHA tokenization by Snippets in CPU Plugin on LLMs'. We consider the presence of `ScaledDotProductAttentionWithKVCache` op in the model as a sign that this model is LLM.* - *Cherry-picked from #28601 ### Tickets: - *160634* - *160978 (contains performance validation results)*

a-sidorova added the do_not_merge label Jan 22, 2025

a-sidorova requested review from a team as code owners January 22, 2025 07:01

github-actions bot added the category: CPU OpenVINO CPU plugin label Jan 22, 2025

a-sidorova added this to the 2025.1 milestone Jan 22, 2025

dmitry-gorokhov self-assigned this Jan 22, 2025

a-sidorova force-pushed the feature/snippets/disable_mha_token_in_llm branch 2 times, most recently from 0e62943 to 08aeea7 Compare January 22, 2025 09:04

IvanNovoselov approved these changes Jan 22, 2025

View reviewed changes

a-sidorova mentioned this pull request Jan 22, 2025

[Snippets][CPU][Port to 2025.0] Disable MHA tokenization in LLM #28611

Merged

a-sidorova removed the do_not_merge label Jan 22, 2025

dmitry-gorokhov approved these changes Jan 23, 2025

View reviewed changes

a-sidorova force-pushed the feature/snippets/disable_mha_token_in_llm branch from cad0554 to e432b62 Compare January 23, 2025 07:13

a-sidorova added 2 commits January 23, 2025 16:02

[Snippets][CPU] Disable MHA tokenization in LLM

a764856

[Snippets][CPU] Added PagedAttentionExtension to check

8e13f0c

a-sidorova force-pushed the feature/snippets/disable_mha_token_in_llm branch from e432b62 to 8e13f0c Compare January 23, 2025 12:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Snippets][CPU] Disable MHA tokenization in LLM #28601

[Snippets][CPU] Disable MHA tokenization in LLM #28601

a-sidorova commented Jan 22, 2025 •

edited

Loading

IvanNovoselov Jan 22, 2025

a-sidorova Jan 23, 2025

[Snippets][CPU] Disable MHA tokenization in LLM #28601

Are you sure you want to change the base?

[Snippets][CPU] Disable MHA tokenization in LLM #28601

Conversation

a-sidorova commented Jan 22, 2025 • edited Loading

Details:

Tickets:

TODO:

IvanNovoselov Jan 22, 2025

Choose a reason for hiding this comment

a-sidorova Jan 23, 2025

Choose a reason for hiding this comment

a-sidorova commented Jan 22, 2025 •

edited

Loading