When using --quantize fp8 model hangs and does not response at all with "ERROR: Arch conditional MMA instruction used without targeting appropriate compute capability. Aborting." #3135

tamastarjanyi · 2025-03-24T13:18:08Z

System Info

Environment is an OpenShift Kubernetes container. See details below.
Any version > 3.1.0 produces this error. (3.1.0 still works fine.)
Always using official unmodified docker images.

2025-03-24T13:02:19.134113Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.85.0
Commit sha: 4d28897b4e345f4dfdd93d3434e50ac8afcdf9e1
Docker label: sha-4d28897
nvidia-smi:
Mon Mar 24 13:02:19 2025       
   +-----------------------------------------------------------------------------------------+
   | NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
   |-----------------------------------------+------------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
   |                                         |                        |               MIG M. |
   |=========================================+========================+======================|
   |   0  NVIDIA H100 80GB HBM3          On  |   00000000:18:00.0 Off |                    0 |
   | N/A   34C    P0            113W /  700W |   63090MiB /  81559MiB |      0%      Default |
   |                                         |                        |             Disabled |
   +-----------------------------------------+------------------------+----------------------+
                                                                                            
   +-----------------------------------------------------------------------------------------+
   | Processes:                                                                              |
   |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
   |        ID   ID                                                               Usage      |
   |=========================================================================================|
   +-----------------------------------------------------------------------------------------+
xpu-smi:
N/A
hpu-smi:
N/A

2025-03-24T13:02:19.134137Z  INFO text_generation_launcher: Args {
    model_id: "bigscience/bloom-560m",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: None,
    speculate: None,
    dtype: None,
    kv_cache_dtype: None,
    trust_remote_code: true,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: None,
    max_total_tokens: None,
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "ecdollm-tgi-px-5bc7b9fb77-z8gtm",
    port: 80,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data/hf/hub/",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    api_key: None,
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: true,
    max_client_batch_size: 4,
    lora_adapters: None,
    usage_stats: On,
    payload_limit: 2000000,
    enable_prefill_logprobs: false,
}

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Enabling --quantize fp8 produces below error log in DEBUG mode and the model seems to be hanging and does not response at all.
No error with arguments and model is working
--model-id meta-llama/Llama-3.1-8B-Instruct --cuda-memory-fraction 0.95 --max-input-tokens 80000 --max-total-tokens 80100

Error with arguments and model is not responding but raising error in DEBUG mode.
--model-id meta-llama/Llama-3.1-8B-Instruct --quantize fp8 --max-input-tokens 79990 --max-total-tokens 80000
ERROR: Arch conditional MMA instruction used without targeting appropriate compute capability. Aborting.

Expected behavior

Model should work with both argument sets. But using fp8 causing error and model is not responding.

The text was updated successfully, but these errors were encountered:

Trofleb · 2025-04-03T14:17:45Z

We're having the same issue. Here's a few additional info:

It seems it's very specific to h100 as I'm not having the same problem with L40s.
When I use a non quantize model everything works perfectly fine.
I get the massive slowness if I use a pre-quantized model or the --quantize argument

Model: Qwen/Qwen2.5-Coder-14B-Instruct

Hope this helps and I'm available for testing different configs if it can help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using --quantize fp8 model hangs and does not response at all with "ERROR: Arch conditional MMA instruction used without targeting appropriate compute capability. Aborting." #3135

When using --quantize fp8 model hangs and does not response at all with "ERROR: Arch conditional MMA instruction used without targeting appropriate compute capability. Aborting." #3135

tamastarjanyi commented Mar 24, 2025

Trofleb commented Apr 3, 2025

When using --quantize fp8 model hangs and does not response at all with "ERROR: Arch conditional MMA instruction used without targeting appropriate compute capability. Aborting." #3135

When using --quantize fp8 model hangs and does not response at all with "ERROR: Arch conditional MMA instruction used without targeting appropriate compute capability. Aborting." #3135

Comments

tamastarjanyi commented Mar 24, 2025

System Info

Information

Tasks

Reproduction

Expected behavior

Trofleb commented Apr 3, 2025