-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide model info in chat ui & allow multiple models #598
Comments
We have just llama.cpp server and vllm server integrated but are open to llama-cpp-python server also. |
This is how it should look when we've multiple models:
|
@sallyom I think we should consider this: https://github.com/BerriAI/litellm it's compatible with vllm and llama.cpp, it sits in front of either. (whereas llama-cpp-python only gives us a llama.cpp solution) vllm closed the door on this feature suggesting users just put litellm in front: vllm-project/vllm#299 (comment) considering we want switchable runtimes, it's ideal. |
After reading a bit, I'm gonna propose we write our own proxy, litellm won't provide what we need. You can proxy/route/bridge (whichever terminology you prefer) requests to pre-spun up llama-servers or vllm servers running a each running a single model with litellm. So using something like litellm we'd need to spin up a model server for every single model on a machine which would use too much memory. So I'm proposing we write something called ramalama-server (which might be more correctly named ramalama-proxy but it may grow to have other features in future, I suggest python3, the http libraries we need should be built into it's stdlib, it will only be a couple of hundred lines of python3). So when we call:
there's no functional change, except the python3 script we now call in the container would be "ramalama-server --runtime vllm/llama.cpp ...". We should make this run without containers too of course for macOS. So when we call without specifying a model alternatively like:
It will run a ramalama-server process, which reads the names of all the available models on disk in the ~/.local/share/ramalama/models/ directory (so it's immediately able to repond to /models requests above in the example):
When the first request comes in for a model (I'm gonna use llama-server as the example, but we would also ensure compatibility with vllm):
ramalama-server then forks a llama-server process serving granite3-moe model:
When a second request comes in for the same model:
We just re-use the running process:
When a third request comes in for a new model, granite3-dense:
We terminate the existing llama-server process and spin up a new one because now we need to serve a new model:
Thoughts @sallyom ? |
We should probably wrap "llama-run" in a "ramalama-runner" python3 script while we are at it, that will prove useful. |
That makes some sense. We might also want to enhance that to also be able to handle the RAG serving as well. |
Yes! I tried to use |
It's a team effort 😄 |
Please overload us. |
Rather than serving all models with generic
model.file
filename, ramalama should provide more information about the currently served or loaded models.Also, could allow passing a model-config file to allow to switch easily between models with a single instance of a server.
https://llama-cpp-python.readthedocs.io/en/latest/server/#configuration-and-multi-model-support
Here, I've renamed the file ramalama downloaded to
granite-code
so it shows at llamacpp:port/models.Currently, any model served by ramalama is listed as
model.file
The text was updated successfully, but these errors were encountered: