Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cache config inconsistency between shortfin and sharktank #406

Closed
wants to merge 2 commits into from

Conversation

renxida
Copy link
Contributor

@renxida renxida commented Nov 1, 2024

Fixes #405 and #401

Up till now we have been patching the configs to reorganize it after exporting from sharktank, but for llama3 8b (which has grouped attention), the config doesn't specify grouping, so this has to be fixed.

Might be a good idea to rename all attn to attention for consistency. But I want this PR to just fix the issue at hand.

After this fix, the only required bit of editing is setting config.paged_kv_cache.device_block_count

@renxida renxida requested review from rsuderman and stbaione November 1, 2024 02:56
@renxida renxida force-pushed the cache-config-consistency branch from 754ae09 to 93c0d19 Compare November 1, 2024 03:34
@renxida renxida force-pushed the cache-config-consistency branch from 93c0d19 to cd309ac Compare November 1, 2024 03:47
@renxida
Copy link
Contributor Author

renxida commented Nov 1, 2024

image

Proof it works.

Tested with script:

https://gist.github.com/renxida/2023ec7f9931c697879762c9330de001

@renxida renxida marked this pull request as ready for review November 1, 2024 14:35
"block_seq_stride": llama_config.block_seq_stride,
"paged_kv_cache": {
"block_seq_stride": llama_config.block_seq_stride,
"attn_head_count_kv": hp.attention_head_count_kv,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, in sharktank's llm_configs.py, hp.attention_head_count_kv==hp.attention_head_count.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you req vmfb and irpa for shortfin, is there a reason we rely on generate_params_json params instead of the irpa params? There are ways to load & access irpa params directly in sharktank. Do we want/have something similar in shortfin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ty! i should probably take a deeper look then to figure this part out.

@renxida renxida marked this pull request as draft November 1, 2024 17:07
@renxida
Copy link
Contributor Author

renxida commented Dec 12, 2024

Replaced by #487

@renxida renxida closed this Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mismatches between config.json exported by export_paged_llm_v1.py and expected by shortfin
2 participants