You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, when I switch to the ONNX model, I get the following error:
k logs -f gembed-predictor-00002-deployment-7d4b8d6f67-8g9xf -n kserve
2025-03-27T11:58:40.694775Z INFO text_embeddings_router: router/src/main.rs:185: Args { model_id: "/****", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hf_token: None, hostname: "gembed-predictor-00002-deployment-7d4b8d6f67-8g9xf", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2025-03-27T11:58:40.698252Z WARN text_embeddings_router: router/src/lib.rs:403: The --pooling arg is not set and we could not find a pooling configuration (1_Pooling/config.json) for this model but the model is a BERT variant. Defaulting to CLS pooling.
2025-03-27T11:58:41.365892Z WARN text_embeddings_router: router/src/lib.rs:188: Could not find a Sentence Transformers config
2025-03-27T11:58:41.365911Z INFO text_embeddings_router: router/src/lib.rs:192: Maximum number of tokens per request: 8192
2025-03-27T11:58:41.366116Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:28: Starting 1 tokenization workers
2025-03-27T11:58:41.864311Z INFO text_embeddings_router: router/src/lib.rs:234: Starting model backend
2025-03-27T11:58:41.865066Z ERROR text_embeddings_backend: backends/src/lib.rs:388: Could not start Candle backend: Could not start backend: No such file or directory (os error 2)
Error: Could not create backend
3.When I change the image to cpu-1.6 and test it, it works normally.
2025-03-27T14:11:33.231208Z INFO text_embeddings_router: router/src/main.rs:175: Args { model_id: "/****", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "gembed-predictor-00001-deployment-56ccb599cf-gzjp8", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2025-03-27T14:11:33.237362Z WARN text_embeddings_router: router/src/lib.rs:392: The --pooling arg is not set and we could not find a pooling configuration (1_Pooling/config.json) for this model but the model is a BERT variant. Defaulting to CLS pooling.
2025-03-27T14:11:33.897769Z WARN text_embeddings_router: router/src/lib.rs:184: Could not find a Sentence Transformers config
2025-03-27T14:11:33.897784Z INFO text_embeddings_router: router/src/lib.rs:188: Maximum number of tokens per request: 8192
2025-03-27T14:11:33.898748Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:28: Starting 1 tokenization workers
2025-03-27T14:11:34.405665Z INFO text_embeddings_router: router/src/lib.rs:230: Starting model backend
2025-03-27T14:11:40.755400Z WARN text_embeddings_router: router/src/lib.rs:258: Backend does not support a batch size > 8
2025-03-27T14:11:40.755416Z WARN text_embeddings_router: router/src/lib.rs:259: forcing max_batch_requests=8
2025-03-27T14:11:40.755519Z WARN text_embeddings_router: router/src/lib.rs:310: Invalid hostname, defaulting to 0.0.0.0
2025-03-27T14:11:40.757444Z INFO text_embeddings_router::http::server: router/src/http/server.rs:1812: Starting HTTP server: 0.0.0.0:8080
2025-03-27T14:11:40.757456Z INFO text_embeddings_router::http::server: router/src/http/server.rs:1813: Ready
When I test with the turing-latest image, I get the same error as with turing-1.6
Expected behavior
I'm not sure if the issue is with the Turing image or my configuration.
The text was updated successfully, but these errors were encountered:
Hey @gogomasaru thanks for reporting, but I'm afraid that this is indeed expected, since the ONNX backend is enabled via the ort feature, which is not included when using the CUDA images as it's not enabled when compiling Text Embeddings Inference (TEI) within those images.
So the general recommendation would be to use CUDA-compatible images when you have a CUDA device meaning that you will benefit from the speed ups of the accelerator, whilst if you have CPUs the recommendation is to use a CPU-compatible image that comes with the ONNX and Safetensors backends.
Thanks in advance 🤗 Let us know if there's anything else we can help with!
System Info
offline and airgapped ENV
OS version: rhel8.19
Model: bge-m3
Hardware: NVIDIA GPU T4
Deployment: Kubernetes (kserve)
Current version: turing-1.6
Information
Tasks
Reproduction
image: turing-1.6
pytorch_model.bin
YAML :
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: gembed
namespace: kserve
spec:
predictor:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- args:
- '--model-id'
- /data
env:
- name: HUGGINGFACE_HUB_CACHE
value: /data
image: ghcr.io/huggingface/text-embeddings-inference:turing-latest
imagePullPolicy: IfNotPresent
name: kserve-container
ports:
- containerPort: 8080
protocol: TCP
resources:
limits:
cpu: '1'
memory: 4Gi
nvidia.com/gpu: '1'
requests:
cpu: '1'
memory: 1Gi
nvidia.com/gpu: '1'
volumeMounts:
- name: gembed-onnx-volume
mountPath: /data
maxReplicas: 1
minReplicas: 1
volumes:
- name: gembed-onnx-volume
persistentVolumeClaim:
claimName: gembed-onnx-pv-claim
k logs -f gembed-predictor-00002-deployment-7d4b8d6f67-8g9xf -n kserve
2025-03-27T11:58:40.694775Z INFO text_embeddings_router: router/src/main.rs:185: Args { model_id: "/****", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hf_token: None, hostname: "gembed-predictor-00002-deployment-7d4b8d6f67-8g9xf", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2025-03-27T11:58:40.698252Z WARN text_embeddings_router: router/src/lib.rs:403: The
--pooling
arg is not set and we could not find a pooling configuration (1_Pooling/config.json
) for this model but the model is a BERT variant. Defaulting toCLS
pooling.2025-03-27T11:58:41.365892Z WARN text_embeddings_router: router/src/lib.rs:188: Could not find a Sentence Transformers config
2025-03-27T11:58:41.365911Z INFO text_embeddings_router: router/src/lib.rs:192: Maximum number of tokens per request: 8192
2025-03-27T11:58:41.366116Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:28: Starting 1 tokenization workers
2025-03-27T11:58:41.864311Z INFO text_embeddings_router: router/src/lib.rs:234: Starting model backend
2025-03-27T11:58:41.865066Z ERROR text_embeddings_backend: backends/src/lib.rs:388: Could not start Candle backend: Could not start backend: No such file or directory (os error 2)
Error: Could not create backend
3.When I change the image to cpu-1.6 and test it, it works normally.
2025-03-27T14:11:33.231208Z INFO text_embeddings_router: router/src/main.rs:175: Args { model_id: "/****", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "gembed-predictor-00001-deployment-56ccb599cf-gzjp8", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2025-03-27T14:11:33.237362Z WARN text_embeddings_router: router/src/lib.rs:392: The
--pooling
arg is not set and we could not find a pooling configuration (1_Pooling/config.json
) for this model but the model is a BERT variant. Defaulting toCLS
pooling.2025-03-27T14:11:33.897769Z WARN text_embeddings_router: router/src/lib.rs:184: Could not find a Sentence Transformers config
2025-03-27T14:11:33.897784Z INFO text_embeddings_router: router/src/lib.rs:188: Maximum number of tokens per request: 8192
2025-03-27T14:11:33.898748Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:28: Starting 1 tokenization workers
2025-03-27T14:11:34.405665Z INFO text_embeddings_router: router/src/lib.rs:230: Starting model backend
2025-03-27T14:11:40.755400Z WARN text_embeddings_router: router/src/lib.rs:258: Backend does not support a batch size > 8
2025-03-27T14:11:40.755416Z WARN text_embeddings_router: router/src/lib.rs:259: forcing
max_batch_requests=8
2025-03-27T14:11:40.755519Z WARN text_embeddings_router: router/src/lib.rs:310: Invalid hostname, defaulting to 0.0.0.0
2025-03-27T14:11:40.757444Z INFO text_embeddings_router::http::server: router/src/http/server.rs:1812: Starting HTTP server: 0.0.0.0:8080
2025-03-27T14:11:40.757456Z INFO text_embeddings_router::http::server: router/src/http/server.rs:1813: Ready
Expected behavior
I'm not sure if the issue is with the Turing image or my configuration.
The text was updated successfully, but these errors were encountered: