Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Larger Context Length #144

Open
c0zaut opened this issue Dec 17, 2024 · 5 comments
Open

Feature Request: Larger Context Length #144

c0zaut opened this issue Dec 17, 2024 · 5 comments

Comments

@c0zaut
Copy link

c0zaut commented Dec 17, 2024

Per issue: #93

I would like to open a feature request to support a context length greater than 4k. While I can initialize with larger contexts, and submit and get output, it still throws a matmul error, and the output is not accurate at best:

* Running on local URL:  http://0.0.0.0:8080

To create a public link, set `share=True` in `launch()`.
No model loaded! Continuing with initialization...
=========INITIALIZING===========
I rkllm: rkllm-runtime version: 1.1.2, rknpu driver version: 0.9.7, platform: RK3588

RKLLM Model, internlm2_5-1_8b-chat-w8a8_g512-opt has been initialized successfully!
==============================

E RKNN: [00:45:12.110] meet unkown shape, op name: matmul_qkv_rkllm_spilt_1, shape: 64, 4160, 128
2features matmul matmul run failed
E RKNN: [00:45:12.110] meet unkown shape, op name: matmul_qkv_rkllm_spilt_2, shape: 64, 4160, 128
2features matmul matmul run failed
E RKNN: [00:45:12.125] meet unkown shape, op name: matmul_qk_rkllm_spilt_2, shape: 64, 128, 4160
2features matmul matmul run failed
E RKNN: [00:45:12.125] meet unkown shape, op name: matmul_qk_rkllm_spilt_1, shape: 64, 128, 4160

...

E RKNN: [00:45:13.315] meet unkown shape, op name: matmul_qk_rkllm_spilt_0, shape: 64, 128, 4224
2features matmul matmul run failed
E RKNN: [00:45:13.321] meet unkown shape, op name: matmul_qkv_rkllm_spilt_0, shape: 64, 4224, 128
E RKNN: [00:45:13.321] meet unkown shape, op name: matmul_qkv_rkllm_spilt_1, shape: 64, 4224, 128
2features matmul matmul run failed
2features matmul matmul run failed

...

E RKNN: [00:45:13.546] meet unkown shape, op name: matmul_qk_rkllm_spilt_0, shape: 64, 128, 4288
2features matmul matmul run failed
E RKNN: [00:45:13.553] meet unkown shape, op name: matmul_qkv_rkllm_spilt_1, shape: 64, 4288, 128
E RKNN: [00:45:13.553] meet unkown shape, op name: matmul_qkv_rkllm_spilt_2, shape: 64, 4288, 128
2features matmul matmul run failed
2features matmul matmul run failed

...

--------------------------------------------------------------------------------------
 Stage         Total Time (ms)  Tokens    Time per Token (ms)      Tokens per Second       
--------------------------------------------------------------------------------------
 Prefill       48433.63         5052      9.59                     104.31                  
 Generate      3751388.33       8191      458.65                   2.18                    
--------------------------------------------------------------------------------------

I do know that with llama.cpp you can configure rope scaling for larger context windows, and there are references to its debug output in librkllmrt.so. Is there any param we can set to expand the context window? If not, would it be possible to add some support for that?

@imkebe
Copy link

imkebe commented Dec 17, 2024

Yes. The 4k is small. I was also trying to use longer contexts and the infrence failed, while the initial memory allocation was done.

@waydong
Copy link
Collaborator

waydong commented Dec 23, 2024

Hi, the max context supported is 4096. What is your application scenario, and how large of a context length do you need?

@c0zaut
Copy link
Author

c0zaut commented Dec 23, 2024

@waydong - Thank you for confirming that is still the case! I want to do incorporate RAG and web search tool calling, which requires a larger context window for things like summarizing legal documents and sorting through a large number of search results.. Some models can go up to 132K, but anything over 8K would be greatly appreciated!

I currently have a basic chat implementation of RKLLM, using Gradio and based on the example you provide: https://github.com/c0zaut/RKLLM-Gradio

Another useful case would be for inputting larger obj meshes into LLama Mesh: c01zaut/LLaMA-Mesh-rk3588-1.1.2

Web search examples:

https://github.com/InternLM/MindSearch

https://github.com/infinigence/InfiniWebSearch

Megrez 3B model here: https://huggingface.co/Infinigence/Megrez-3B-Instruct

^ converts to RKLLM and works like charm, as long as you go through all of the config files and set eos token to 120005 instead of 120025 or make sure you use <|turn_end|> in your prefix and postfix.

Thank you!

@imkebe
Copy link

imkebe commented Dec 23, 2024

As for now we are talking about 128k max - this is a max for the known supported models, phi3, qwen2.5 etc.

@c0zaut
Copy link
Author

c0zaut commented Dec 25, 2024

@waydong - it would also be nice to have a long context for chat history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants