Provide model info in chat ui & allow multiple models #598

sallyom · 2025-01-16T22:38:33Z

Rather than serving all models with generic model.file filename, ramalama should provide more information about the currently served or loaded models.

Also, could allow passing a model-config file to allow to switch easily between models with a single instance of a server.
https://llama-cpp-python.readthedocs.io/en/latest/server/#configuration-and-multi-model-support

Here, I've renamed the file ramalama downloaded to granite-code so it shows at llamacpp:port/models.

Currently, any model served by ramalama is listed as model.file

The text was updated successfully, but these errors were encountered:

ericcurtin · 2025-01-16T23:14:56Z

We have just llama.cpp server and vllm server integrated but are open to llama-cpp-python server also.

ericcurtin · 2025-01-17T00:20:42Z

This is how it should look when we've multiple models:

{"object":"list","data":[{"id":"smollm:360m","object":"model","created":1737073177,"owned_by":"library"},{"id":"smollm:135m","object":"model","created":1736344249,"owned_by":"library"}]}

ericcurtin · 2025-01-17T00:41:55Z

@sallyom I think we should consider this:

https://github.com/BerriAI/litellm

it's compatible with vllm and llama.cpp, it sits in front of either. (whereas llama-cpp-python only gives us a llama.cpp solution)

vllm closed the door on this feature suggesting users just put litellm in front:

vllm-project/vllm#299 (comment)

considering we want switchable runtimes, it's ideal.

ericcurtin · 2025-01-17T00:47:06Z

https://docs.litellm.ai/docs/providers/vllm

ericcurtin · 2025-01-17T12:03:01Z

After reading a bit, I'm gonna propose we write our own proxy, litellm won't provide what we need. You can proxy/route/bridge (whichever terminology you prefer) requests to pre-spun up llama-servers or vllm servers running a each running a single model with litellm. So using something like litellm we'd need to spin up a model server for every single model on a machine which would use too much memory.

So I'm proposing we write something called ramalama-server (which might be more correctly named ramalama-proxy but it may grow to have other features in future, I suggest python3, the http libraries we need should be built into it's stdlib, it will only be a couple of hundred lines of python3).

So when we call:

ramalama serve some-model

there's no functional change, except the python3 script we now call in the container would be "ramalama-server --runtime vllm/llama.cpp ...". We should make this run without containers too of course for macOS.

So when we call without specifying a model alternatively like:

ramalama serve

It will run a ramalama-server process, which reads the names of all the available models on disk in the ~/.local/share/ramalama/models/ directory (so it's immediately able to repond to /models requests above in the example):

ramalama serve -> ramalama-server

When the first request comes in for a model (I'm gonna use llama-server as the example, but we would also ensure compatibility with vllm):

request: granite3-moe "Write an email to my boss notifying them about vacation plans"

ramalama-server then forks a llama-server process serving granite3-moe model:

ramalama-server -> llama-server granite3-moe

When a second request comes in for the same model:

request: granite3-moe "Thank my boss for approving my request"

We just re-use the running process:

ramalama-server -> llama-server granite3-moe

When a third request comes in for a new model, granite3-dense:

request: granite3-dense "Write a git commit message for this change"

We terminate the existing llama-server process and spin up a new one because now we need to serve a new model:

ramalama-server -> llama-server granite3-dense

Thoughts @sallyom ?

ericcurtin · 2025-01-17T12:50:05Z

We should probably wrap "llama-run" in a "ramalama-runner" python3 script while we are at it, that will prove useful.

rhatdan · 2025-01-17T13:07:24Z

That makes some sense. We might also want to enhance that to also be able to handle the RAG serving as well.

ericcurtin · 2025-01-21T16:42:19Z

@rhatdan @slemeur "ramalama serve"/"ramalama-server" is the endpoint Podman AI Lab would talk to if the projects were to converge

I would suggest Podman AI Lab ignores commands like "ramalama run" and "ramalama pull", it's the CLI use case.

vpavlin · 2025-01-23T12:12:31Z

Rather than serving all models with generic model.file filename, ramalama should provide more information about the currently served or loaded models.

Yes! I tried to use ramalama serve yesterday and had the same thought:) But did not want overload @ericcurtin with too many new issues:-P

ericcurtin · 2025-01-23T12:48:20Z

It's a team effort 😄

rhatdan · 2025-01-23T19:11:46Z

Please overload us.

ericcurtin mentioned this issue Jan 20, 2025

Multi-model server support #255

Closed

ericcurtin mentioned this issue Jan 23, 2025

OpenWebUI hangs when used with Ramalama open-webui/open-webui#8802

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide model info in chat ui & allow multiple models #598

Provide model info in chat ui & allow multiple models #598

sallyom commented Jan 16, 2025

ericcurtin commented Jan 16, 2025

ericcurtin commented Jan 17, 2025

ericcurtin commented Jan 17, 2025 •

edited

Loading

ericcurtin commented Jan 17, 2025

ericcurtin commented Jan 17, 2025 •

edited

Loading

ericcurtin commented Jan 17, 2025

rhatdan commented Jan 17, 2025

ericcurtin commented Jan 21, 2025 •

edited

Loading

vpavlin commented Jan 23, 2025

ericcurtin commented Jan 23, 2025

rhatdan commented Jan 23, 2025

Provide model info in chat ui & allow multiple models #598

Provide model info in chat ui & allow multiple models #598

Comments

sallyom commented Jan 16, 2025

ericcurtin commented Jan 16, 2025

ericcurtin commented Jan 17, 2025

ericcurtin commented Jan 17, 2025 • edited Loading

ericcurtin commented Jan 17, 2025

ericcurtin commented Jan 17, 2025 • edited Loading

ericcurtin commented Jan 17, 2025

rhatdan commented Jan 17, 2025

ericcurtin commented Jan 21, 2025 • edited Loading

vpavlin commented Jan 23, 2025

ericcurtin commented Jan 23, 2025

rhatdan commented Jan 23, 2025

ericcurtin commented Jan 17, 2025 •

edited

Loading

ericcurtin commented Jan 17, 2025 •

edited

Loading

ericcurtin commented Jan 21, 2025 •

edited

Loading