You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using --quantize fp8 model hangs and does not response at all with "ERROR: Arch conditional MMA instruction used without targeting appropriate compute capability. Aborting."
#3135
Open
1 of 4 tasks
tamastarjanyi opened this issue
Mar 24, 2025
· 1 comment
Environment is an OpenShift Kubernetes container. See details below.
Any version > 3.1.0 produces this error. (3.1.0 still works fine.)
Always using official unmodified docker images.
Enabling --quantize fp8 produces below error log in DEBUG mode and the model seems to be hanging and does not response at all.
No error with arguments and model is working --model-id meta-llama/Llama-3.1-8B-Instruct --cuda-memory-fraction 0.95 --max-input-tokens 80000 --max-total-tokens 80100
Error with arguments and model is not responding but raising error in DEBUG mode. --model-id meta-llama/Llama-3.1-8B-Instruct --quantize fp8 --max-input-tokens 79990 --max-total-tokens 80000 ERROR: Arch conditional MMA instruction used without targeting appropriate compute capability. Aborting.
Expected behavior
Model should work with both argument sets. But using fp8 causing error and model is not responding.
The text was updated successfully, but these errors were encountered:
System Info
Environment is an OpenShift Kubernetes container. See details below.
Any version > 3.1.0 produces this error. (3.1.0 still works fine.)
Always using official unmodified docker images.
Information
Tasks
Reproduction
Enabling --quantize fp8 produces below error log in DEBUG mode and the model seems to be hanging and does not response at all.
No error with arguments and model is working
--model-id meta-llama/Llama-3.1-8B-Instruct --cuda-memory-fraction 0.95 --max-input-tokens 80000 --max-total-tokens 80100
Error with arguments and model is not responding but raising error in DEBUG mode.
--model-id meta-llama/Llama-3.1-8B-Instruct --quantize fp8 --max-input-tokens 79990 --max-total-tokens 80000
ERROR: Arch conditional MMA instruction used without targeting appropriate compute capability. Aborting.
Expected behavior
Model should work with both argument sets. But using fp8 causing error and model is not responding.
The text was updated successfully, but these errors were encountered: