Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using --quantize fp8 model hangs and does not response at all with "ERROR: Arch conditional MMA instruction used without targeting appropriate compute capability. Aborting." #3135

Open
1 of 4 tasks
tamastarjanyi opened this issue Mar 24, 2025 · 1 comment

Comments

@tamastarjanyi
Copy link

System Info

Environment is an OpenShift Kubernetes container. See details below.
Any version > 3.1.0 produces this error. (3.1.0 still works fine.)
Always using official unmodified docker images.

2025-03-24T13:02:19.134113Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.85.0
Commit sha: 4d28897b4e345f4dfdd93d3434e50ac8afcdf9e1
Docker label: sha-4d28897
nvidia-smi:
Mon Mar 24 13:02:19 2025       
   +-----------------------------------------------------------------------------------------+
   | NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
   |-----------------------------------------+------------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
   |                                         |                        |               MIG M. |
   |=========================================+========================+======================|
   |   0  NVIDIA H100 80GB HBM3          On  |   00000000:18:00.0 Off |                    0 |
   | N/A   34C    P0            113W /  700W |   63090MiB /  81559MiB |      0%      Default |
   |                                         |                        |             Disabled |
   +-----------------------------------------+------------------------+----------------------+
                                                                                            
   +-----------------------------------------------------------------------------------------+
   | Processes:                                                                              |
   |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
   |        ID   ID                                                               Usage      |
   |=========================================================================================|
   +-----------------------------------------------------------------------------------------+
xpu-smi:
N/A
hpu-smi:
N/A

2025-03-24T13:02:19.134137Z  INFO text_generation_launcher: Args {
    model_id: "bigscience/bloom-560m",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: None,
    speculate: None,
    dtype: None,
    kv_cache_dtype: None,
    trust_remote_code: true,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: None,
    max_total_tokens: None,
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "ecdollm-tgi-px-5bc7b9fb77-z8gtm",
    port: 80,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data/hf/hub/",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    api_key: None,
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: true,
    max_client_batch_size: 4,
    lora_adapters: None,
    usage_stats: On,
    payload_limit: 2000000,
    enable_prefill_logprobs: false,
}

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Enabling --quantize fp8 produces below error log in DEBUG mode and the model seems to be hanging and does not response at all.
No error with arguments and model is working
--model-id meta-llama/Llama-3.1-8B-Instruct --cuda-memory-fraction 0.95 --max-input-tokens 80000 --max-total-tokens 80100

Error with arguments and model is not responding but raising error in DEBUG mode.
--model-id meta-llama/Llama-3.1-8B-Instruct --quantize fp8 --max-input-tokens 79990 --max-total-tokens 80000
ERROR: Arch conditional MMA instruction used without targeting appropriate compute capability. Aborting.

Expected behavior

Model should work with both argument sets. But using fp8 causing error and model is not responding.

@Trofleb
Copy link

Trofleb commented Apr 3, 2025

We're having the same issue. Here's a few additional info:

  • It seems it's very specific to h100 as I'm not having the same problem with L40s.
  • When I use a non quantize model everything works perfectly fine.
  • I get the massive slowness if I use a pre-quantized model or the --quantize argument

Model: Qwen/Qwen2.5-Coder-14B-Instruct

Hope this helps and I'm available for testing different configs if it can help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants