Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide model info in chat ui & allow multiple models #598

Open
sallyom opened this issue Jan 16, 2025 · 11 comments
Open

Provide model info in chat ui & allow multiple models #598

sallyom opened this issue Jan 16, 2025 · 11 comments

Comments

@sallyom
Copy link
Collaborator

sallyom commented Jan 16, 2025

Rather than serving all models with generic model.file filename, ramalama should provide more information about the currently served or loaded models.

Also, could allow passing a model-config file to allow to switch easily between models with a single instance of a server.
https://llama-cpp-python.readthedocs.io/en/latest/server/#configuration-and-multi-model-support

Here, I've renamed the file ramalama downloaded to granite-code so it shows at llamacpp:port/models.
Image

Currently, any model served by ramalama is listed as model.file
Image

@ericcurtin
Copy link
Collaborator

We have just llama.cpp server and vllm server integrated but are open to llama-cpp-python server also.

@ericcurtin
Copy link
Collaborator

This is how it should look when we've multiple models:

{"object":"list","data":[{"id":"smollm:360m","object":"model","created":1737073177,"owned_by":"library"},{"id":"smollm:135m","object":"model","created":1736344249,"owned_by":"library"}]}

@ericcurtin
Copy link
Collaborator

ericcurtin commented Jan 17, 2025

@sallyom I think we should consider this:

https://github.com/BerriAI/litellm

it's compatible with vllm and llama.cpp, it sits in front of either. (whereas llama-cpp-python only gives us a llama.cpp solution)

vllm closed the door on this feature suggesting users just put litellm in front:

vllm-project/vllm#299 (comment)

considering we want switchable runtimes, it's ideal.

@ericcurtin
Copy link
Collaborator

@ericcurtin
Copy link
Collaborator

ericcurtin commented Jan 17, 2025

After reading a bit, I'm gonna propose we write our own proxy, litellm won't provide what we need. You can proxy/route/bridge (whichever terminology you prefer) requests to pre-spun up llama-servers or vllm servers running a each running a single model with litellm. So using something like litellm we'd need to spin up a model server for every single model on a machine which would use too much memory.

So I'm proposing we write something called ramalama-server (which might be more correctly named ramalama-proxy but it may grow to have other features in future, I suggest python3, the http libraries we need should be built into it's stdlib, it will only be a couple of hundred lines of python3).

So when we call:

ramalama serve some-model

there's no functional change, except the python3 script we now call in the container would be "ramalama-server --runtime vllm/llama.cpp ...". We should make this run without containers too of course for macOS.

So when we call without specifying a model alternatively like:

ramalama serve

It will run a ramalama-server process, which reads the names of all the available models on disk in the ~/.local/share/ramalama/models/ directory (so it's immediately able to repond to /models requests above in the example):

ramalama serve -> ramalama-server

When the first request comes in for a model (I'm gonna use llama-server as the example, but we would also ensure compatibility with vllm):

request: granite3-moe "Write an email to my boss notifying them about vacation plans"

ramalama-server then forks a llama-server process serving granite3-moe model:

ramalama-server -> llama-server granite3-moe

When a second request comes in for the same model:

request: granite3-moe "Thank my boss for approving my request"

We just re-use the running process:

ramalama-server -> llama-server granite3-moe

When a third request comes in for a new model, granite3-dense:

request: granite3-dense "Write a git commit message for this change"

We terminate the existing llama-server process and spin up a new one because now we need to serve a new model:

ramalama-server -> llama-server granite3-dense

Thoughts @sallyom ?

@ericcurtin
Copy link
Collaborator

We should probably wrap "llama-run" in a "ramalama-runner" python3 script while we are at it, that will prove useful.

@rhatdan
Copy link
Member

rhatdan commented Jan 17, 2025

That makes some sense. We might also want to enhance that to also be able to handle the RAG serving as well.

@ericcurtin
Copy link
Collaborator

ericcurtin commented Jan 21, 2025

@rhatdan @slemeur "ramalama serve"/"ramalama-server" is the endpoint Podman AI Lab would talk to if the projects were to converge

I would suggest Podman AI Lab ignores commands like "ramalama run" and "ramalama pull", it's the CLI use case.

@vpavlin
Copy link

vpavlin commented Jan 23, 2025

Rather than serving all models with generic model.file filename, ramalama should provide more information about the currently served or loaded models.

Yes! I tried to use ramalama serve yesterday and had the same thought:) But did not want overload @ericcurtin with too many new issues:-P

@ericcurtin
Copy link
Collaborator

It's a team effort 😄

@rhatdan
Copy link
Member

rhatdan commented Jan 23, 2025

Please overload us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants