We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2025-03-28T14:39:27.430620Z INFO text_generation_launcher: Runtime environment: Target: x86_64-unknown-linux-gnu Cargo version: 1.85.0 Commit sha: 4d28897b4e345f4dfdd93d3434e50ac8afcdf9e1 Docker label: sha-4d28897 nvidia-smi: Fri Mar 28 14:39:27 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3060 Off | 00000000:08:00.0 On | N/A | | 0% 51C P3 32W / 170W | 6618MiB / 12288MiB | 39% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| +-----------------------------------------------------------------------------------------+ xpu-smi: N/A hpu-smi: N/A 2025-03-28T14:39:27.430665Z INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, kv_cache_dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "46c8c5d5f669", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], api_key: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true, max_client_batch_size: 4, lora_adapters: None, usage_stats: On, payload_limit: 2000000, enable_prefill_logprobs: false, }
Just running Granite-Vision-3.2-2B causes the crash on start-up:
docker run --gpus all --shm-size 1g -p 8080:80 -v ./models:/data ghcr.io/huggingface/text-generation-inference:3.2.1 --model-id ibm-granite/granite-vision-3.2-2b
2025-03-28T14:33:21.511700Z INFO text_generation_launcher: Using Attention = flashdecoding 2025-03-28T14:33:24.392413Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2025-03-28T14:33:34.400967Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2025-03-28T14:33:44.409181Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2025-03-28T14:33:49.270771Z ERROR text_generation_launcher: Error when initializing model Traceback (most recent call last): File "/usr/src/.venv/bin/text-generation-server", line 10, in <module> sys.exit(app()) File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in __call__ return get_command(self)(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in __call__ return self.main(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main return _main( File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main rv = self.invoke(ctx) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke return __callback(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper return callback(**use_params) File "/usr/src/server/text_generation_server/cli.py", line 119, in serve server.serve( File "/usr/src/server/text_generation_server/server.py", line 315, in serve asyncio.run( File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run return runner.run(main) File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete self.run_forever() File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever self._run_once() File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once handle._run() File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run self._context.run(self._callback, *self._args) > File "/usr/src/server/text_generation_server/server.py", line 268, in serve_inner model = get_model_with_lora_adapters( File "/usr/src/server/text_generation_server/models/__init__.py", line 1690, in get_model_with_lora_adapters model = get_model( File "/usr/src/server/text_generation_server/models/__init__.py", line 1586, in get_model return VlmCausalLM( File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 362, in __init__ super().__init__( File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1269, in __init__ model = model_class(prefix, config, weights) File "/usr/src/server/text_generation_server/models/custom_modeling/llava_next.py", line 120, in __init__ if config.vision_feature_layer < 0: TypeError: '<' not supported between instances of 'list' and 'int' 2025-03-28T14:33:51.727103Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output: 2025-03-28 14:33:17.195 | INFO | text_generation_server.utils.import_utils:<module>:76 - Detected system cuda /usr/src/server/text_generation_server/layers/gptq/triton.py:242: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @custom_fwd(cast_inputs=torch.float16) /usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @custom_fwd /usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. @custom_bwd /usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @custom_fwd /usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. @custom_bwd Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ │ │ /usr/src/server/text_generation_server/models/custom_modeling/llava_next.py: │ │ 120 in __init__ │ │ │ │ 117 │ │ vision_config = config.vision_config │ │ 118 │ │ # Instead of selecting in hidden_states[-2]. │ │ 119 │ │ # Instead compute only the n -2 + 1 layers and don't pool │ │ ❱ 120 │ │ if config.vision_feature_layer < 0: │ │ 121 │ │ │ vision_config.num_hidden_layers += config.vision_feature_l │ │ 122 │ │ else: │ │ 123 │ │ │ vision_config.num_hidden_layers = config.vision_feature_la │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ config = LlavaNextConfig { │ │ │ │ "_name_or_path": "ibm-granite/granite-vision-3.2-2b", │ │ │ │ "architectures": [ │ │ │ │ │ "LlavaNextForConditionalGeneration" │ │ │ │ ], │ │ │ │ "image_grid_pinpoints": [ │ │ │ │ │ [ │ │ │ │ │ 384, │ │ │ │ │ 384 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 384, │ │ │ │ │ 768 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 384, │ │ │ │ │ 1152 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 384, │ │ │ │ │ 1536 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 384, │ │ │ │ │ 1920 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 384, │ │ │ │ │ 2304 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 384, │ │ │ │ │ 2688 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 384, │ │ │ │ │ 3072 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 384, │ │ │ │ │ 3456 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 384, │ │ │ │ │ 3840 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 768, │ │ │ │ │ 384 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 768, │ │ │ │ │ 768 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 768, │ │ │ │ │ 1152 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 768, │ │ │ │ │ 1536 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 768, │ │ │ │ │ 1920 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 1152, │ │ │ │ │ 384 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 1152, │ │ │ │ │ 768 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 1152, │ │ │ │ │ 1152 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 1536, │ │ │ │ │ 384 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 1536, │ │ │ │ │ 768 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 1920, │ │ │ │ │ 384 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 1920, │ │ │ │ │ 768 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 2304, │ │ │ │ │ 384 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 2688, │ │ │ │ │ 384 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 3072, │ │ │ │ │ 384 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 3456, │ │ │ │ │ 384 │ │ │ │ │ ], │ │ │ │ │ [ │ │ │ │ │ 3840, │ │ │ │ │ 384 │ │ │ │ │ ] │ │ │ │ ], │ │ │ │ "image_seq_length": 576, │ │ │ │ "image_token_index": 49155, │ │ │ │ "model_type": "llava_next", │ │ │ │ "multimodal_projector_bias": true, │ │ │ │ "projector_hidden_act": "gelu", │ │ │ │ "quantize": null, │ │ │ │ "speculator": null, │ │ │ │ "text_config": { │ │ │ │ │ "architectures": [ │ │ │ │ │ "GraniteForCausalLM" │ │ │ │ │ ], │ │ │ │ │ "attention_dropout": 0.1, │ │ │ │ │ "attention_multiplier": 0.015625, │ │ │ │ │ "bos_token_id": 0, │ │ │ │ │ "embedding_multiplier": 12.0, │ │ │ │ │ "eos_token_id": 0, │ │ │ │ │ "hidden_size": 2048, │ │ │ │ │ "intermediate_size": 8192, │ │ │ │ │ "logits_scaling": 8.0, │ │ │ │ │ "max_position_embeddings": 131072, │ │ │ │ │ "model_type": "granite", │ │ │ │ │ "num_hidden_layers": 40, │ │ │ │ │ "num_key_value_heads": 8, │ │ │ │ │ "pad_token_id": 0, │ │ │ │ │ "residual_multiplier": 0.22, │ │ │ │ │ "rms_norm_eps": 1e-05, │ │ │ │ │ "rope_theta": 300000, │ │ │ │ │ "tie_word_embeddings": true, │ │ │ │ │ "torch_dtype": "bfloat16", │ │ │ │ │ "vocab_size": 49156 │ │ │ │ }, │ │ │ │ "tie_word_embeddings": true, │ │ │ │ "transformers_version": "4.49.0", │ │ │ │ "use_image_newline_parameter": true, │ │ │ │ "vision_config": { │ │ │ │ │ "hidden_act": "gelu_pytorch_tanh", │ │ │ │ │ "hidden_size": 1152, │ │ │ │ │ "image_size": 384, │ │ │ │ │ "intermediate_size": 4304, │ │ │ │ │ "layer_norm_eps": 1e-06, │ │ │ │ │ "model_type": "siglip_vision_model", │ │ │ │ │ "num_attention_heads": 16, │ │ │ │ │ "num_hidden_layers": 27, │ │ │ │ │ "patch_size": 14, │ │ │ │ │ "quantize": null │ │ │ │ }, │ │ │ │ "vision_feature_layer": [ │ │ │ │ │ -24, │ │ │ │ │ -20, │ │ │ │ │ -12, │ │ │ │ │ -1 │ │ │ │ ], │ │ │ │ "vision_feature_select_strategy": "full" │ │ │ │ } │ │ │ │ prefix = None │ │ │ │ self = LlavaNextForConditionalGeneration() │ │ │ │ vision_config = SiglipVisionConfig { │ │ │ │ "attention_dropout": 0.0, │ │ │ │ "hidden_act": "gelu_pytorch_tanh", │ │ │ │ "hidden_size": 1152, │ │ │ │ "image_size": 384, │ │ │ │ "intermediate_size": 4304, │ │ │ │ "layer_norm_eps": 1e-06, │ │ │ │ "model_type": "siglip_vision_model", │ │ │ │ "num_attention_heads": 16, │ │ │ │ "num_channels": 3, │ │ │ │ "num_hidden_layers": 27, │ │ │ │ "patch_size": 14, │ │ │ │ "quantize": null, │ │ │ │ "transformers_version": "4.49.0" │ │ │ │ } │ │ │ │ weights = <text_generation_server.utils.weights.Weights object at │ │ │ │ 0x71fc54eaaa10> │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ ╰──────────────────────────────────────────────────────────────────────────────╯ TypeError: '<' not supported between instances of 'list' and 'int' rank=0 2025-03-28T14:33:51.791963Z ERROR text_generation_launcher: Shard 0 failed to start 2025-03-28T14:33:51.791989Z INFO text_generation_launcher: Shutting down shards Error: ShardCannotStart
It's caused by this check:
text-generation-inference/server/text_generation_server/models/custom_modeling/llava_next.py
Line 120 in 0142550
The code expects vision_feature_layer to be a number, but in Granite's config, it's a list of values:
vision_feature_layer
"vision_feature_layer": [ -24, -20, -12, -1 ],
I don't know if Granite doesn't follow the intended schema or if it's purely a TGI issue
I would have expected the model to load without any issues, as it uses the LlavaNextForConditionalGeneration architecture
LlavaNextForConditionalGeneration
The text was updated successfully, but these errors were encountered:
No branches or pull requests
System Info
Information
Tasks
Reproduction
Just running Granite-Vision-3.2-2B causes the crash on start-up:
Excerpt from log
It's caused by this check:
text-generation-inference/server/text_generation_server/models/custom_modeling/llava_next.py
Line 120 in 0142550
The code expects
vision_feature_layer
to be a number, but in Granite's config, it's a list of values:I don't know if Granite doesn't follow the intended schema or if it's purely a TGI issue
Expected behavior
I would have expected the model to load without any issues, as it uses the
LlavaNextForConditionalGeneration
architectureThe text was updated successfully, but these errors were encountered: