-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: switch images of llama.cpp to the RamaLama images #2708
Conversation
There was an image pull error mentioned in regards to libkrun, but when I tested it on our infra, it seems to be working. I haven't tried with this PR yet, I'll have a look tomorrow #2712 |
note: I tested on macOS with a libkrun podman machine, I haven't tested on Windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linux native
- Documentation is accessible
- Playground is working as expected
/v1/chat/completion
works well from PostMan
Model refuses to start: MaziyarPanahi/Mistral-7B-Instruct-v0.3.Q4_K_M
|
Also I am not getting model metrics with bartowski/granite-3.1-8b-instruct-GGUF works with TheBloke/Mistral-7B-Instruct-v0.2-GGUF |
for the crash it's that the chat template is not available current list is chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, exaone3, gemma, granite, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, monarch, openchat, orion, phi3, rwkv-world, vicuna, vicuna-orca, zephyr wanted |
template for function calling updated in latest commit to |
we might need some enhancement in ramalama to start the server with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on Windows without a GPU
rebased and temporary images for now (until the next release of RamaLama is out) just to try the PR it includes this change from RamaLama containers/ramalama#1053 I'm able to run the functional programming recipe |
it should be ok now (except that I'll replace the custom image by the official one on the next release of RamaLama) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On windows running model MaziyarPanahi/Mistral-7B-Instruct-v0.3.Q4_K_M
the server is not starting
Logs
{"msg":"exec container process `/usr/bin/llama-server.sh`: Exec format error","level":"error","time":"2025-03-28T14:30:48.439048Z"}
Image used quay.io/fbenoit/ramalama-llama-server:jinja-2025-03-27
@axel7083 I think this is expected for now as my image is arm64 only. I'm waiting for official image ( next release) |
Good to know, tag me when I will need to test again 👍 |
@axel7083 0.7.2 images of RamaLama are now available since few minutes, so I switched to these images |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CPU image is working, however the cuda image does not start and get the following error
chmod: cannot access './run.sh': No such file or directory
Sorry it took so long to review, the image is 6.75 GB and my internet is not nice
forgot to notify here that I switched to the respin of the images yesterday evening |
@axel7083 looks like run.sh is from AI Lab podman-desktop-extension-ai-lab/packages/backend/src/workers/provider/LlamaCppPython.ts Lines 158 to 162 in 9fa9843
but the image has no run.sh script anyway |
I updated the images of RamaLama I also update the entrypoint to be used and removed the chmod operation as the script has correct permissions it should now work on Windows and Linux |
"cuda": "ghcr.io/containers/podman-desktop-extension-ai-lab-playground-images/ai-lab-playground-chat-cuda@sha256:e4b57e52c31b379b4a73f8e9536bc130fdea665d88dbd05643350295b3402a2f", | ||
"vulkan": "ghcr.io/containers/podman-desktop-extension-ai-lab-playground-images/ai-lab-playground-chat-vulkan@sha256:6a93b247099643f4f8c78ee9896c2ce4e9a455af114a69be09c16ad36aa51fd2" | ||
"default": "quay.io/ramalama/ramalama-llama-server@sha256:cbadb36fbbc2abf9867a33e6dfe3f2df4a76774259b5d4d24d50f4fc7e525406", | ||
"cuda": "quay.io/ramalama/cuda-llama-server@sha256:cbadb36fbbc2abf9867a33e6dfe3f2df4a76774259b5d4d24d50f4fc7e525406" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't pull this image?
> podman pull quay.io/ramalama/cuda-llama-server@sha256:cbadb36fbbc2abf9867a33e6dfe3f2df4a76774259b5d4d24d50f4fc7e525406
Trying to pull quay.io/ramalama/cuda-llama-server@sha256:cbadb36fbbc2abf9867a33e6dfe3f2df4a76774259b5d4d24d50f4fc7e525406...
Error: initializing source docker://quay.io/ramalama/cuda-llama-server@sha256:cbadb36fbbc2abf9867a33e6dfe3f2df4a76774259b5d4d24d50f4fc7e525406: reading manifest sha256:cbadb36fbbc2abf9867a33e6dfe3f2df4a76774259b5d4d24d50f4fc7e525406 in quay.io/ramalama/cuda-llama-server: manifest unknown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 the tag has been overriden so the sha is no longer there, updating...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that the cuda image is 404 ?
fixes containers#2630 Signed-off-by: Florent Benoit <fbenoit@redhat.com>
Signed-off-by: Florent Benoit <fbenoit@redhat.com>
Signed-off-by: Florent Benoit <fbenoit@redhat.com>
…entrypoint Signed-off-by: Florent Benoit <fbenoit@redhat.com>
Signed-off-by: Florent Benoit <fbenoit@redhat.com>
Signed-off-by: Florent Benoit <fbenoit@redhat.com>
@axel7083 updated the sha, hopping the tag won't be overwritten today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay after testing with the cuda image on Windows 11 (WSL2), the current configuration is not able to use the GPU, due to gpu config.
Deep dive
We need to change the default value for the gpu layers (currently the default is -1
when undefined).
We need to update the following to replace -1
with 99
gpuLayers: options.gpuLayers ?? -1, |
We also need to update the following comments
podman-desktop-extension-ai-lab/packages/shared/src/models/InferenceServerConfig.ts
Line 52 in d302152
* -1 to offload all the layers |
By changing the above, I was able to make the inference go fast fast fast
ok so basically I had to apply the env for libkrun but we need that for Windows/NVidia as well |
yes, before it was working as -1 was max out, but now -1 give 0 offloaded |
…ound Signed-off-by: Florent Benoit <fbenoit@redhat.com>
@axel7083 PR amended |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
tested some recipes and models (ok not all combinations) without issues on Windows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, ramalama inference starts
What does this PR do?
switch to the RamaLama images
Use default images for macOS with gpuLayers=999 (as RamaLama does for libkrun/macOS) rather than using vulkan images
(vulkan images are failing with libKrun (unstable on my end))
Screenshot / video of UI
What issues does this PR fix or reference?
fixes #2630
How to test this PR?
try to start a playground/service for a model