Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: switch images of llama.cpp to the RamaLama images #2708

Merged
merged 7 commits into from
Apr 9, 2025

Conversation

benoitf
Copy link
Collaborator

@benoitf benoitf commented Mar 14, 2025

What does this PR do?

switch to the RamaLama images

Use default images for macOS with gpuLayers=999 (as RamaLama does for libkrun/macOS) rather than using vulkan images
(vulkan images are failing with libKrun (unstable on my end))

Screenshot / video of UI

What issues does this PR fix or reference?

fixes #2630

How to test this PR?

try to start a playground/service for a model

@benoitf benoitf requested review from jeffmaury and a team as code owners March 14, 2025 08:37
@benoitf benoitf requested review from cdrage and gastoner March 14, 2025 08:37
@benoitf benoitf marked this pull request as draft March 14, 2025 08:38
@benoitf benoitf marked this pull request as ready for review March 17, 2025 09:49
@ScrewTSW
Copy link
Member

There was an image pull error mentioned in regards to libkrun, but when I tested it on our infra, it seems to be working. I haven't tried with this PR yet, I'll have a look tomorrow #2712

@benoitf
Copy link
Collaborator Author

benoitf commented Mar 17, 2025

note: I tested on macOS with a libkrun podman machine, I haven't tested on Windows

Copy link
Contributor

@axel7083 axel7083 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linux native

  • Documentation is accessible
  • Playground is working as expected
  • /v1/chat/completion works well from PostMan

@jeffmaury
Copy link
Contributor

Model refuses to start: MaziyarPanahi/Mistral-7B-Instruct-v0.3.Q4_K_M


warning: no usable GPU found, --gpu-layers option will be ignored
warning: one possible reason is that llama.cpp was compiled without GPU support
warning: consult docs/build.md for compilation instructions
terminate called after throwing an instance of 'std::runtime_error'
  what():  error: the supplied chat template is not supported: chatml-function-calling
note: llama.cpp was started without --jinja, we only support commonly used templates

�/usr/bin/llama-server.sh: line 14:     2 Aborted                 (core dumped) llama-server --model ${MODEL_PATH} --host ${HOST:=0.0.0.0} --port ${PORT:=8001} --gpu_layers ${GPU_LAYERS:=0} ${CHAT_FORMAT}

@jeffmaury
Copy link
Contributor

Also I am not getting model metrics with bartowski/granite-3.1-8b-instruct-GGUF works with TheBloke/Mistral-7B-Instruct-v0.2-GGUF

@benoitf
Copy link
Collaborator Author

benoitf commented Mar 17, 2025

for the crash it's that the chat template is not available

current list is chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, exaone3, gemma, granite, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, monarch, openchat, orion, phi3, rwkv-world, vicuna, vicuna-orca, zephyr

wanted chatml-function-calling which is not there

@benoitf
Copy link
Collaborator Author

benoitf commented Mar 17, 2025

template for function calling updated in latest commit to chatml

@benoitf
Copy link
Collaborator Author

benoitf commented Mar 17, 2025

we might need some enhancement in ramalama to start the server with --jinja flag

Copy link
Contributor

@gastoner gastoner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on Windows without a GPU

@benoitf
Copy link
Collaborator Author

benoitf commented Mar 27, 2025

rebased and temporary images for now (until the next release of RamaLama is out) just to try the PR

it includes this change from RamaLama containers/ramalama#1053

I'm able to run the functional programming recipe

@benoitf
Copy link
Collaborator Author

benoitf commented Mar 27, 2025

it should be ok now (except that I'll replace the custom image by the official one on the next release of RamaLama)

Copy link
Contributor

@axel7083 axel7083 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On windows running model MaziyarPanahi/Mistral-7B-Instruct-v0.3.Q4_K_M the server is not starting

Logs

{"msg":"exec container process `/usr/bin/llama-server.sh`: Exec format error","level":"error","time":"2025-03-28T14:30:48.439048Z"}

Image used quay.io/fbenoit/ramalama-llama-server:jinja-2025-03-27

@benoitf
Copy link
Collaborator Author

benoitf commented Mar 28, 2025

@axel7083 I think this is expected for now as my image is arm64 only. I'm waiting for official image ( next release)

@axel7083
Copy link
Contributor

@axel7083 I think this is expected for now as my image is arm64 only. I'm waiting for official image ( next release)

Good to know, tag me when I will need to test again 👍

@benoitf
Copy link
Collaborator Author

benoitf commented Mar 31, 2025

@axel7083 0.7.2 images of RamaLama are now available since few minutes, so I switched to these images

@benoitf benoitf requested a review from axel7083 March 31, 2025 14:27
Copy link
Contributor

@axel7083 axel7083 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CPU image is working, however the cuda image does not start and get the following error

chmod: cannot access './run.sh': No such file or directory

Sorry it took so long to review, the image is 6.75 GB and my internet is not nice

@benoitf
Copy link
Collaborator Author

benoitf commented Mar 31, 2025

@benoitf
Copy link
Collaborator Author

benoitf commented Apr 1, 2025

forgot to notify here that I switched to the respin of the images yesterday evening

@ScrewTSW
Copy link
Member

ScrewTSW commented Apr 1, 2025

Screenshot_20250401_112510
currently cannot start projects on GPU, Linux, tgz, Nvidia RTX 3090

Succeeds with GPU support disabled
Screenshot_20250401_113025

@benoitf
Copy link
Collaborator Author

benoitf commented Apr 1, 2025

@axel7083 looks like run.sh is from AI Lab

user = '0';
entrypoint = '/usr/bin/sh';
cmd = ['-c', 'chmod 755 ./run.sh && ./run.sh'];

but the image has no run.sh script anyway

@benoitf
Copy link
Collaborator Author

benoitf commented Apr 7, 2025

I updated the images of RamaLama

I also update the entrypoint to be used and removed the chmod operation as the script has correct permissions

it should now work on Windows and Linux

@benoitf benoitf requested a review from axel7083 April 8, 2025 13:03
"cuda": "ghcr.io/containers/podman-desktop-extension-ai-lab-playground-images/ai-lab-playground-chat-cuda@sha256:e4b57e52c31b379b4a73f8e9536bc130fdea665d88dbd05643350295b3402a2f",
"vulkan": "ghcr.io/containers/podman-desktop-extension-ai-lab-playground-images/ai-lab-playground-chat-vulkan@sha256:6a93b247099643f4f8c78ee9896c2ce4e9a455af114a69be09c16ad36aa51fd2"
"default": "quay.io/ramalama/ramalama-llama-server@sha256:cbadb36fbbc2abf9867a33e6dfe3f2df4a76774259b5d4d24d50f4fc7e525406",
"cuda": "quay.io/ramalama/cuda-llama-server@sha256:cbadb36fbbc2abf9867a33e6dfe3f2df4a76774259b5d4d24d50f4fc7e525406"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't pull this image?

> podman pull quay.io/ramalama/cuda-llama-server@sha256:cbadb36fbbc2abf9867a33e6dfe3f2df4a76774259b5d4d24d50f4fc7e525406
Trying to pull quay.io/ramalama/cuda-llama-server@sha256:cbadb36fbbc2abf9867a33e6dfe3f2df4a76774259b5d4d24d50f4fc7e525406...
Error: initializing source docker://quay.io/ramalama/cuda-llama-server@sha256:cbadb36fbbc2abf9867a33e6dfe3f2df4a76774259b5d4d24d50f4fc7e525406: reading manifest sha256:cbadb36fbbc2abf9867a33e6dfe3f2df4a76774259b5d4d24d50f4fc7e525406 in quay.io/ramalama/cuda-llama-server: manifest unknown

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 the tag has been overriden so the sha is no longer there, updating...

Copy link
Contributor

@axel7083 axel7083 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that the cuda image is 404 ?

benoitf added 6 commits April 8, 2025 16:36
fixes containers#2630
Signed-off-by: Florent Benoit <fbenoit@redhat.com>
Signed-off-by: Florent Benoit <fbenoit@redhat.com>
Signed-off-by: Florent Benoit <fbenoit@redhat.com>
…entrypoint

Signed-off-by: Florent Benoit <fbenoit@redhat.com>
Signed-off-by: Florent Benoit <fbenoit@redhat.com>
Signed-off-by: Florent Benoit <fbenoit@redhat.com>
@benoitf
Copy link
Collaborator Author

benoitf commented Apr 8, 2025

@axel7083 updated the sha, hopping the tag won't be overwritten today

Copy link
Contributor

@axel7083 axel7083 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay after testing with the cuda image on Windows 11 (WSL2), the current configuration is not able to use the GPU, due to gpu config.

Deep dive

We need to change the default value for the gpu layers (currently the default is -1 when undefined).

We need to update the following to replace -1 with 99

We also need to update the following comments

By changing the above, I was able to make the inference go fast fast fast

@benoitf
Copy link
Collaborator Author

benoitf commented Apr 8, 2025

ok so basically I had to apply the env for libkrun but we need that for Windows/NVidia as well

@axel7083
Copy link
Contributor

axel7083 commented Apr 8, 2025

ok so basically I had to apply the env for libkrun but we need that for Windows/NVidia as well

yes, before it was working as -1 was max out, but now -1 give 0 offloaded

…ound

Signed-off-by: Florent Benoit <fbenoit@redhat.com>
@benoitf
Copy link
Collaborator Author

benoitf commented Apr 8, 2025

@axel7083 PR amended

Copy link
Contributor

@axel7083 axel7083 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@jeffmaury jeffmaury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

tested some recipes and models (ok not all combinations) without issues on Windows

Copy link
Member

@ScrewTSW ScrewTSW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, ramalama inference starts

@benoitf benoitf merged commit e34d59f into containers:main Apr 9, 2025
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use ramalama images for llama-cpp models
5 participants