-
Notifications
You must be signed in to change notification settings - Fork 14k
Description
Name and Version
PS D:\llama\latest> .\llama-cli.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from D:\llama\latest\ggml-cuda.dll
load_backend: loaded RPC backend from D:\llama\latest\ggml-rpc.dll
load_backend: loaded CPU backend from D:\llama\latest\ggml-cpu-sapphirerapids.dll
version: 7300 (2960eb2)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
GGML backends
CUDA
Hardware
RTX PRO 6000 and RTX A6000
Models
qwen next testing with
unsloth\Qwen3-Next-80B-A3B-Instruct-GGUF\Qwen3-Next-80B-A3B-Instruct-UD-Q4_K_XL
Problem description & steps to reproduce
This doesn't seem to affect other models but when inferencing qwen 3 next looks like some things are still run on cpu which results in significant performance loss.
Screenshot was taken during tg stage for qwen next bench

First Bad Commit
No response
Relevant log output
PS D:\llama> .\benchq3nxt.bat
D:\llama>set CUDA_VISIBLE_DEVICES=0
D:\llama>d:/llama/latest/llama-bench.exe -m d:\models\unsloth\Qwen3-Next-80B-A3B-Instruct-GGUF\Qwen3-Next-80B-A3B-Instruct-UD-Q4_K_XL.gguf
-p 512 -n 512 -b 1024 -ub 512 -ngl 99 -mmp 0 -fa 1 -o md -r 3 -d 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from d:\llama\latest\ggml-cuda.dll
load_backend: loaded RPC backend from d:\llama\latest\ggml-rpc.dll
load_backend: loaded CPU backend from d:\llama\latest\ggml-cpu-sapphirerapids.dll
| model | size | params | backend | ngl | n_batch | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ---: | --------------: | -------------------: |
| qwen3next ?B Q4_K - Medium | 42.01 GiB | 79.67 B | CUDA | 99 | 1024 | 1 | 0 | pp512 | 1211.32 ± 29.80 |
| qwen3next ?B Q4_K - Medium | 42.01 GiB | 79.67 B | CUDA | 99 | 1024 | 1 | 0 | tg512 | 21.16 ± 0.20 |
build: 2960eb297 (7300)
for comparison gpt-oss on the same build and hardware, cpu remains background level during pp and tg
PS D:\llama> .\bench.bat
D:\llama>set CUDA_VISIBLE_DEVICES=0
D:\llama>d:/llama/latest/llama-bench.exe -m d:\models\lmstudio-community\gpt-oss-120b-GGUF\gpt-oss-120b-MXFP4-00001-of-00002.gguf
-p 512 -n 512 -b 1024 -ub 512 -ngl 99 -mmp 0 -fa 1 -o md -r 3 -d 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from d:\llama\latest\ggml-cuda.dll
load_backend: loaded RPC backend from d:\llama\latest\ggml-rpc.dll
load_backend: loaded CPU backend from d:\llama\latest\ggml-cpu-sapphirerapids.dll
| model | size | params | backend | ngl | n_batch | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 1024 | 1 | 0 | pp512 | 4716.71 ± 175.03 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 1024 | 1 | 0 | tg512 | 241.51 ± 1.02 |
build: 2960eb297 (7300)
PS D:\llama>