You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
i have a problem on running gemma 3 12B-it on my server. i have 2 gpus [Quadro rtx-8000] . when i want to run the model in server with docker i faced this error "window_size_left is only available with flash attn v2".
this is my command for run the model :
docker run -itd --gpus all -p 8090:80 -v /MODEL_PATH/models--google--gemma-3-12b-it:/models ghcr.io/huggingface/text-generation-inference:3.2.0 --model-id /models --trust-remote-code
MODEL_PATH is my local path.
all stdout error is here :
2025-03-17T07:49:06.934488Z INFO text_generation_launcher: Args {
model_id: "/models",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: None,
speculate: None,
dtype: None,
kv_cache_dtype: None,
trust_remote_code: true,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: None,
max_total_tokens: None,
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "66b3e932e479",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
payload_limit: 2000000,
enable_prefill_logprobs: false,
}
2025-03-17T07:49:08.661952Z INFO text_generation_launcher: Disabling prefix caching because of VLM model
2025-03-17T07:49:08.661984Z INFO text_generation_launcher: Forcing attention to 'paged' because head dim is not supported by flashinfer, also disabling prefix caching
2025-03-17T07:49:08.661996Z INFO text_generation_launcher: Using attention paged - Prefix caching 0
2025-03-17T07:49:08.685972Z WARN text_generation_launcher: Unkown compute for card quadro-rtx-8000
2025-03-17T07:49:08.708377Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 8000
2025-03-17T07:49:08.708391Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2025-03-17T07:49:08.708397Z WARN text_generation_launcher: trust_remote_code is set. Trusting that model /models do not contain malicious code.
2025-03-17T07:49:08.708523Z INFO download: text_generation_launcher: Starting check and download process for /models
2025-03-17T07:49:13.550626Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2025-03-17T07:49:14.431533Z INFO download: text_generation_launcher: Successfully downloaded weights for /models
2025-03-17T07:49:14.431782Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2025-03-17T07:49:19.487534Z INFO text_generation_launcher: Using prefix caching = False
2025-03-17T07:49:19.487608Z INFO text_generation_launcher: Using Attention = paged
2025-03-17T07:49:24.455842Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-17T07:49:34.464077Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-17T07:49:44.536315Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-17T07:49:54.609719Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-17T07:49:55.903453Z INFO text_generation_launcher: Using prefill chunking = False
2025-03-17T07:49:56.840458Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2025-03-17T07:49:56.911755Z INFO shard-manager: text_generation_launcher: Shard ready in 42.466748423s rank=0
2025-03-17T07:49:56.978093Z INFO text_generation_launcher: Starting Webserver
2025-03-17T07:49:57.029434Z INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2025-03-17T07:49:57.433717Z INFO text_generation_launcher: Using optimized Triton indexing kernels.
2025-03-17T07:50:33.808299Z INFO text_generation_launcher: KV-cache blocks: 1262, size: 16
2025-03-17T07:50:33.912370Z INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
2025-03-17T07:50:37.271253Z INFO text_generation_router_v3: backends/v3/src/lib.rs:137: Setting max batch total tokens to 20192
2025-03-17T07:50:37.271292Z INFO text_generation_router_v3: backends/v3/src/lib.rs:166: Using backend V3
2025-03-17T07:50:37.271298Z INFO text_generation_router: backends/v3/src/main.rs:162: Maximum input tokens defaulted to 7999
2025-03-17T07:50:37.271303Z INFO text_generation_router: backends/v3/src/main.rs:168: Maximum total tokens defaulted to 8000
2025-03-17T07:50:37.271414Z WARN text_generation_router::server: router/src/server.rs:1648: Tokenizer_config None - Some("/models/tokenizer_config.json")
2025-03-17T07:50:37.273899Z INFO text_generation_router::server: router/src/server.rs:1661: Using chat template from chat_template.json
2025-03-17T07:50:45.068606Z INFO text_generation_router::server: router/src/server.rs:1716: Using config Some(Gemma3(Gemma3 { vision_config: Gemma3VisionConfig { image_size: 896, patch_size: 14 } }))
2025-03-17T07:50:45.068693Z WARN text_generation_router::server: router/src/server.rs:1776: no pipeline tag found for model /models
2025-03-17T07:50:45.068700Z WARN text_generation_router::server: router/src/server.rs:1879: Invalid hostname, defaulting to 0.0.0.0
2025-03-17T07:50:45.310356Z INFO text_generation_router::server: router/src/server.rs:2266: Connected
2025-03-17T07:51:38.198370Z ERROR text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
File "/usr/src/.venv/bin/text-generation-server", line 10, in
sys.exit(app())
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in call
return get_command(self)(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in call
return self.main(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main
return _main(
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main
rv = self.invoke(ctx)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
return callback(**use_params)
File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
server.serve(
File "/usr/src/server/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/usr/src/.venv/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/usr/src/server/text_generation_server/server.py", line 183, in Prefill
generations, next_batch, timings = self.model.generate_token(batch)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1971, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 482, in forward
logits, speculative_logits = self.model.forward(
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 888, in forward
hidden_states = self.text_model.model(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 547, in forward
hidden_states, residual = layer(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 449, in forward
attn_output = self.self_attn(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 253, in forward
attn_output = attention(
File "/usr/src/server/text_generation_server/layers/attention/cuda.py", line 295, in attention
raise NotImplementedError(
NotImplementedError: window_size_left is only available with flash attn v2
2025-03-17T07:51:38.199268Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: window_size_left is only available with flash attn v2
2025-03-17T07:51:38.200715Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: true, max_new_tokens: None, return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }}:async_stream:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:546: Request failed during generation: Server error: window_size_left is only available with flash attn v2
2025-03-17T07:51:40.101659Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
2025-03-17 07:49:16.391 | INFO | text_generation_server.utils.import_utils::80 - Detected system cuda
/usr/src/server/text_generation_server/layers/gptq/triton.py:242: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@custom_fwd(cast_inputs=torch.float16)
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@custom_fwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
@custom_bwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@custom_fwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
thank your for your attention .
@custom_bwd
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
Some kwargs in processor config are unused and will not have any effect: image_seq_length.
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py:312: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
lengths_tensor = torch.tensor( rank=0
2025-03-17T07:51:40.192591Z ERROR text_generation_launcher: Shard 0 crashed
2025-03-17T07:51:40.192620Z INFO text_generation_launcher: Terminating webserver
2025-03-17T07:51:40.192640Z INFO text_generation_launcher: Waiting for webserver to gracefully shutdown
2025-03-17T07:51:40.192752Z INFO text_generation_router::server: router/src/server.rs:2363: signal received, starting graceful shutdown
2025-03-17T07:51:40.492975Z INFO text_generation_launcher: webserver terminated
2025-03-17T07:51:40.493004Z INFO text_generation_launcher: Shutting down shards
Error: ShardFailed
Open source status
The model implementation is available
The model weights are available
Provide useful links for the implementation
No response
The text was updated successfully, but these errors were encountered:
Only flashinfer and flashattn 2 support context windows. On RTX 8000 the following happens:
FlashInfer (which is the preferred default) is not used, because 12B-it has a head dimensionality of 240, which is not supported by flashinfer (only 64, 128, 256, and 192/128).
So we fall back to flashattention for prefill, which does support this dimensionality. Unfortunately, RTX 8000 has compute capability 7.5, which is too old for flashattention 2. So we have to fall back to flashattention 1, but it fails with Gemma 3 because it does not support context windows.
This particular Gemma 3 variation can only be used with a newer GPU currently. Another option would be to use a different Gemma 3 model, e.g. 27B has a head size of 128.
If flashinfer supports more head sizes in the future, it might be possible to run 12B.
Model description
i have a problem on running gemma 3 12B-it on my server. i have 2 gpus [Quadro rtx-8000] . when i want to run the model in server with docker i faced this error "window_size_left is only available with flash attn v2".
this is my command for run the model :
docker run -itd --gpus all -p 8090:80 -v /MODEL_PATH/models--google--gemma-3-12b-it:/models ghcr.io/huggingface/text-generation-inference:3.2.0 --model-id /models --trust-remote-code
all stdout error is here :
2025-03-17T07:49:06.934488Z INFO text_generation_launcher: Args {
model_id: "/models",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: None,
speculate: None,
dtype: None,
kv_cache_dtype: None,
trust_remote_code: true,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: None,
max_total_tokens: None,
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "66b3e932e479",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
payload_limit: 2000000,
enable_prefill_logprobs: false,
}
2025-03-17T07:49:08.661952Z INFO text_generation_launcher: Disabling prefix caching because of VLM model
2025-03-17T07:49:08.661984Z INFO text_generation_launcher: Forcing attention to 'paged' because head dim is not supported by flashinfer, also disabling prefix caching
2025-03-17T07:49:08.661996Z INFO text_generation_launcher: Using attention paged - Prefix caching 0
2025-03-17T07:49:08.685972Z WARN text_generation_launcher: Unkown compute for card quadro-rtx-8000
2025-03-17T07:49:08.708377Z INFO text_generation_launcher: Default
max_batch_prefill_tokens
to 80002025-03-17T07:49:08.708391Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2025-03-17T07:49:08.708397Z WARN text_generation_launcher:
trust_remote_code
is set. Trusting that model/models
do not contain malicious code.2025-03-17T07:49:08.708523Z INFO download: text_generation_launcher: Starting check and download process for /models
2025-03-17T07:49:13.550626Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2025-03-17T07:49:14.431533Z INFO download: text_generation_launcher: Successfully downloaded weights for /models
2025-03-17T07:49:14.431782Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2025-03-17T07:49:19.487534Z INFO text_generation_launcher: Using prefix caching = False
2025-03-17T07:49:19.487608Z INFO text_generation_launcher: Using Attention = paged
2025-03-17T07:49:24.455842Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-17T07:49:34.464077Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-17T07:49:44.536315Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-17T07:49:54.609719Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-17T07:49:55.903453Z INFO text_generation_launcher: Using prefill chunking = False
2025-03-17T07:49:56.840458Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2025-03-17T07:49:56.911755Z INFO shard-manager: text_generation_launcher: Shard ready in 42.466748423s rank=0
2025-03-17T07:49:56.978093Z INFO text_generation_launcher: Starting Webserver
2025-03-17T07:49:57.029434Z INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2025-03-17T07:49:57.433717Z INFO text_generation_launcher: Using optimized Triton indexing kernels.
2025-03-17T07:50:33.808299Z INFO text_generation_launcher: KV-cache blocks: 1262, size: 16
2025-03-17T07:50:33.912370Z INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
2025-03-17T07:50:37.271253Z INFO text_generation_router_v3: backends/v3/src/lib.rs:137: Setting max batch total tokens to 20192
2025-03-17T07:50:37.271292Z INFO text_generation_router_v3: backends/v3/src/lib.rs:166: Using backend V3
2025-03-17T07:50:37.271298Z INFO text_generation_router: backends/v3/src/main.rs:162: Maximum input tokens defaulted to 7999
2025-03-17T07:50:37.271303Z INFO text_generation_router: backends/v3/src/main.rs:168: Maximum total tokens defaulted to 8000
2025-03-17T07:50:37.271414Z WARN text_generation_router::server: router/src/server.rs:1648: Tokenizer_config None - Some("/models/tokenizer_config.json")
2025-03-17T07:50:37.273899Z INFO text_generation_router::server: router/src/server.rs:1661: Using chat template from chat_template.json
2025-03-17T07:50:45.068606Z INFO text_generation_router::server: router/src/server.rs:1716: Using config Some(Gemma3(Gemma3 { vision_config: Gemma3VisionConfig { image_size: 896, patch_size: 14 } }))
2025-03-17T07:50:45.068693Z WARN text_generation_router::server: router/src/server.rs:1776: no pipeline tag found for model /models
2025-03-17T07:50:45.068700Z WARN text_generation_router::server: router/src/server.rs:1879: Invalid hostname, defaulting to 0.0.0.0
2025-03-17T07:50:45.310356Z INFO text_generation_router::server: router/src/server.rs:2266: Connected
2025-03-17T07:51:38.198370Z ERROR text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
File "/usr/src/.venv/bin/text-generation-server", line 10, in
sys.exit(app())
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in call
return get_command(self)(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in call
return self.main(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main
return _main(
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main
rv = self.invoke(ctx)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
return callback(**use_params)
File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
server.serve(
File "/usr/src/server/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/usr/src/.venv/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
2025-03-17 07:49:16.391 | INFO | text_generation_server.utils.import_utils::80 - Detected system cuda
/usr/src/server/text_generation_server/layers/gptq/triton.py:242: FutureWarning:
torch.cuda.amp.custom_fwd(args...)
is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')
instead.@custom_fwd(cast_inputs=torch.float16)
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning:
torch.cuda.amp.custom_fwd(args...)
is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')
instead.@custom_fwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning:
torch.cuda.amp.custom_bwd(args...)
is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')
instead.@custom_bwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning:
torch.cuda.amp.custom_fwd(args...)
is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')
instead.@custom_fwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning:
torch.cuda.amp.custom_bwd(args...)
is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')
instead.thank your for your attention .
@custom_bwd
The argument
trust_remote_code
is to be used with Auto classes. It has no effect here and is ignored.Some kwargs in processor config are unused and will not have any effect: image_seq_length.
The argument
trust_remote_code
is to be used with Auto classes. It has no effect here and is ignored./usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py:312: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
lengths_tensor = torch.tensor( rank=0
2025-03-17T07:51:40.192591Z ERROR text_generation_launcher: Shard 0 crashed
2025-03-17T07:51:40.192620Z INFO text_generation_launcher: Terminating webserver
2025-03-17T07:51:40.192640Z INFO text_generation_launcher: Waiting for webserver to gracefully shutdown
2025-03-17T07:51:40.192752Z INFO text_generation_router::server: router/src/server.rs:2363: signal received, starting graceful shutdown
2025-03-17T07:51:40.492975Z INFO text_generation_launcher: webserver terminated
2025-03-17T07:51:40.493004Z INFO text_generation_launcher: Shutting down shards
Error: ShardFailed
Open source status
Provide useful links for the implementation
No response
The text was updated successfully, but these errors were encountered: