Local LLM performance issue #59

LeoLee429 · 2024-07-15T02:14:40Z

LeoLee429
Jul 15, 2024

What Stanford Spezi module is your challenge related to?

SpeziML

Description

Hi, my team is interested in your framework and I am currently exploring especially with LLM features.
I am a newbie to this framework so I have some questions.

I was trying on the localLLM module, the token generation is very slow even on a Q2K llama2 module (1 min per token at worst case). Before integrating with ChatView, the performance was much better. I am 80% sure that it is not a hardware issue since I have tried it on both my phone and emulator. May I request a code review on my sample project to see if I used wrong functions/config?
I observed a behavior that the model is loaded when the first user prompt is sent. May I ask is there any way to load the models beforehand? (e.g. when the app load)
I have considered about future use cases: updating LLM model. Onboardview seems like the best choice for the first time downloading the model. However, it does not support going back to next step. Currently I have implemented a remove model button in chatview, with binding a boolean to choose rather chatview or download view. May I ask if there is better practice of doing this?

I have uploaded the sample project to Github. Please find my repository from the link below / my profile.
Thank you so much for maintaining an open source projects and answering my questions!

Reproduction

https://github.com/dg6546/spezillm_demo

Expected behavior

Faster generation possible?

Additional context

N/A

Code of Conduct

I agree to follow this projects's Code of Conduct and Contributing Guidelines

Answered by PSchmiedmayer

Jan 29, 2025

We want to point out that we now use a way more efficient LLM inference backend in SpeziLLM that builds on top of MLX, it should resolve most of the performance constraints we have seen before.

View full answer

PSchmiedmayer · 2024-08-02T23:48:34Z

PSchmiedmayer
Aug 2, 2024
Maintainer

@dg6546 Thank you for sharing this challenge!

Llamacpp might not be ideal for a lot of on-device use cases, and we don't do a great job optimizing for this in the current state; it would be great to explore the new improvements that Apple shipped with iOS 18 and use CoreML and other Apple technologies for the local LLM execution. PRs to support and extend SpeziLLM are always more than welcome!

You can find documentation on how you can load a model file from anywhere within your app and without a UI using the LLMLocalDownloadManager. You can use this mechanism to download a model, update it, and also then use system components to remove a model that you stored at the download URL passed to the LLMLocalDownloadManager.

Feel free to create any issues that you encounter in the SpeziLLM repo. PRs and contributions are always more than welcome!

1 reply

PSchmiedmayer Jan 29, 2025
Maintainer

We want to point out that we now use a way more efficient LLM inference backend in SpeziLLM that builds on top of MLX, it should resolve most of the performance constraints we have seen before.

Answer selected by PSchmiedmayer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stanford Spezi

Local LLM performance issue #59

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Stanford Spezi

Local LLM performance issue #59

LeoLee429 Jul 15, 2024

What Stanford Spezi module is your challenge related to?

Description

Reproduction

Expected behavior

Additional context

Code of Conduct

Replies: 1 comment · 1 reply

PSchmiedmayer Aug 2, 2024 Maintainer

PSchmiedmayer Jan 29, 2025 Maintainer

LeoLee429
Jul 15, 2024

Replies: 1 comment 1 reply

PSchmiedmayer
Aug 2, 2024
Maintainer

PSchmiedmayer Jan 29, 2025
Maintainer