You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following edits were required to make llama3 8b fp16 work:
config["attn_head_count"] = 8 # 8 instead of 32
config["paged_kv_cache"] = {}
config["paged_kv_cache"]["block_seq_stride"] = config["block_seq_stride"]
del config["block_seq_stride"]
config["paged_kv_cache"]["device_block_count"] = 256
There are 2 main problems:
the attn_head_count should be set to attention_head_count_kv from export_paged_llm_v1 and not attention_head_count. This should be fixed in sharktank, at least by including both attention head counts
kvcache params should be in config["paged_kv_cache"]
Really need integration tests between sharktank and shortfin.
The text was updated successfully, but these errors were encountered:
The following edits were required to make llama3 8b fp16 work:
There are 2 main problems:
attn_head_count
should be set toattention_head_count_kv
from export_paged_llm_v1 and notattention_head_count
. This should be fixed in sharktank, at least by including both attention head countsReally need integration tests between sharktank and shortfin.
The text was updated successfully, but these errors were encountered: