Feature Request: Larger Context Length #144

c0zaut · 2024-12-17T03:39:29Z

Per issue: #93

I would like to open a feature request to support a context length greater than 4k. While I can initialize with larger contexts, and submit and get output, it still throws a matmul error, and the output is not accurate at best:

* Running on local URL:  http://0.0.0.0:8080

To create a public link, set `share=True` in `launch()`.
No model loaded! Continuing with initialization...
=========INITIALIZING===========
I rkllm: rkllm-runtime version: 1.1.2, rknpu driver version: 0.9.7, platform: RK3588

RKLLM Model, internlm2_5-1_8b-chat-w8a8_g512-opt has been initialized successfully！
==============================

E RKNN: [00:45:12.110] meet unkown shape, op name: matmul_qkv_rkllm_spilt_1, shape: 64, 4160, 128
2features matmul matmul run failed
E RKNN: [00:45:12.110] meet unkown shape, op name: matmul_qkv_rkllm_spilt_2, shape: 64, 4160, 128
2features matmul matmul run failed
E RKNN: [00:45:12.125] meet unkown shape, op name: matmul_qk_rkllm_spilt_2, shape: 64, 128, 4160
2features matmul matmul run failed
E RKNN: [00:45:12.125] meet unkown shape, op name: matmul_qk_rkllm_spilt_1, shape: 64, 128, 4160

...

E RKNN: [00:45:13.315] meet unkown shape, op name: matmul_qk_rkllm_spilt_0, shape: 64, 128, 4224
2features matmul matmul run failed
E RKNN: [00:45:13.321] meet unkown shape, op name: matmul_qkv_rkllm_spilt_0, shape: 64, 4224, 128
E RKNN: [00:45:13.321] meet unkown shape, op name: matmul_qkv_rkllm_spilt_1, shape: 64, 4224, 128
2features matmul matmul run failed
2features matmul matmul run failed

...

E RKNN: [00:45:13.546] meet unkown shape, op name: matmul_qk_rkllm_spilt_0, shape: 64, 128, 4288
2features matmul matmul run failed
E RKNN: [00:45:13.553] meet unkown shape, op name: matmul_qkv_rkllm_spilt_1, shape: 64, 4288, 128
E RKNN: [00:45:13.553] meet unkown shape, op name: matmul_qkv_rkllm_spilt_2, shape: 64, 4288, 128
2features matmul matmul run failed
2features matmul matmul run failed

...

--------------------------------------------------------------------------------------
 Stage         Total Time (ms)  Tokens    Time per Token (ms)      Tokens per Second       
--------------------------------------------------------------------------------------
 Prefill       48433.63         5052      9.59                     104.31                  
 Generate      3751388.33       8191      458.65                   2.18                    
--------------------------------------------------------------------------------------

I do know that with llama.cpp you can configure rope scaling for larger context windows, and there are references to its debug output in librkllmrt.so. Is there any param we can set to expand the context window? If not, would it be possible to add some support for that?

imkebe · 2024-12-17T12:13:39Z

Yes. The 4k is small. I was also trying to use longer contexts and the infrence failed, while the initial memory allocation was done.

waydong · 2024-12-23T01:47:08Z

Hi, the max context supported is 4096. What is your application scenario, and how large of a context length do you need?

c0zaut · 2024-12-23T02:37:45Z

@waydong - Thank you for confirming that is still the case! I want to do incorporate RAG and web search tool calling, which requires a larger context window for things like summarizing legal documents and sorting through a large number of search results.. Some models can go up to 132K, but anything over 8K would be greatly appreciated!

I currently have a basic chat implementation of RKLLM, using Gradio and based on the example you provide: https://github.com/c0zaut/RKLLM-Gradio

Another useful case would be for inputting larger obj meshes into LLama Mesh: c01zaut/LLaMA-Mesh-rk3588-1.1.2

Web search examples:

https://github.com/InternLM/MindSearch

https://github.com/infinigence/InfiniWebSearch

Megrez 3B model here: https://huggingface.co/Infinigence/Megrez-3B-Instruct

^ converts to RKLLM and works like charm, as long as you go through all of the config files and set eos token to 120005 instead of 120025 or make sure you use <|turn_end|> in your prefix and postfix.

Thank you!

imkebe · 2024-12-23T10:00:54Z

As for now we are talking about 128k max - this is a max for the known supported models, phi3, qwen2.5 etc.

c0zaut · 2024-12-25T02:31:23Z

@waydong - it would also be nice to have a long context for chat history.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Larger Context Length #144

Feature Request: Larger Context Length #144

c0zaut commented Dec 17, 2024

imkebe commented Dec 17, 2024

waydong commented Dec 23, 2024

c0zaut commented Dec 23, 2024 •

edited

Loading

imkebe commented Dec 23, 2024 •

edited

Loading

c0zaut commented Dec 25, 2024

Feature Request: Larger Context Length #144

Feature Request: Larger Context Length #144

Comments

c0zaut commented Dec 17, 2024

imkebe commented Dec 17, 2024

waydong commented Dec 23, 2024

c0zaut commented Dec 23, 2024 • edited Loading

imkebe commented Dec 23, 2024 • edited Loading

c0zaut commented Dec 25, 2024

c0zaut commented Dec 23, 2024 •

edited

Loading

imkebe commented Dec 23, 2024 •

edited

Loading