Skip to content

Eval bug: qwen 3 next CUDA still uses a lot of CPU during inference #17822

@fuutott

Description

@fuutott

Name and Version

PS D:\llama\latest> .\llama-cli.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from D:\llama\latest\ggml-cuda.dll
load_backend: loaded RPC backend from D:\llama\latest\ggml-rpc.dll
load_backend: loaded CPU backend from D:\llama\latest\ggml-cpu-sapphirerapids.dll
version: 7300 (2960eb2)
built with Clang 19.1.5 for Windows x86_64

Operating systems

Windows

GGML backends

CUDA

Hardware

RTX PRO 6000 and RTX A6000

Models

qwen next testing with
unsloth\Qwen3-Next-80B-A3B-Instruct-GGUF\Qwen3-Next-80B-A3B-Instruct-UD-Q4_K_XL

Problem description & steps to reproduce

This doesn't seem to affect other models but when inferencing qwen 3 next looks like some things are still run on cpu which results in significant performance loss.
Screenshot was taken during tg stage for qwen next bench
Image

First Bad Commit

No response

Relevant log output

PS D:\llama> .\benchq3nxt.bat

D:\llama>set CUDA_VISIBLE_DEVICES=0

D:\llama>d:/llama/latest/llama-bench.exe   -m d:\models\unsloth\Qwen3-Next-80B-A3B-Instruct-GGUF\Qwen3-Next-80B-A3B-Instruct-UD-Q4_K_XL.gguf   
-p 512   -n 512   -b 1024   -ub 512   -ngl 99  -mmp 0   -fa 1   -o md   -r 3   -d 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from d:\llama\latest\ggml-cuda.dll
load_backend: loaded RPC backend from d:\llama\latest\ggml-rpc.dll
load_backend: loaded CPU backend from d:\llama\latest\ggml-cpu-sapphirerapids.dll
| model                          |       size |     params | backend    | ngl | n_batch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ---: | --------------: | -------------------: |
| qwen3next ?B Q4_K - Medium     |  42.01 GiB |    79.67 B | CUDA       |  99 |    1024 |  1 |    0 |           pp512 |      1211.32 ± 29.80 |
| qwen3next ?B Q4_K - Medium     |  42.01 GiB |    79.67 B | CUDA       |  99 |    1024 |  1 |    0 |           tg512 |         21.16 ± 0.20 |

build: 2960eb297 (7300)

for comparison gpt-oss on the same build and hardware, cpu remains background level during pp and tg

PS D:\llama> .\bench.bat

D:\llama>set CUDA_VISIBLE_DEVICES=0

D:\llama>d:/llama/latest/llama-bench.exe   -m d:\models\lmstudio-community\gpt-oss-120b-GGUF\gpt-oss-120b-MXFP4-00001-of-00002.gguf   
-p 512   -n 512   -b 1024   -ub 512   -ngl 99  -mmp 0   -fa 1   -o md   -r 3   -d 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from d:\llama\latest\ggml-cuda.dll
load_backend: loaded RPC backend from d:\llama\latest\ggml-rpc.dll
load_backend: loaded CPU backend from d:\llama\latest\ggml-cpu-sapphirerapids.dll
| model                          |       size |     params | backend    | ngl | n_batch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |    1024 |  1 |    0 |           pp512 |     4716.71 ± 175.03 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |    1024 |  1 |    0 |           tg512 |        241.51 ± 1.02 |

build: 2960eb297 (7300)
PS D:\llama>

Metadata

Metadata

Assignees

No one assigned

    Labels

    CUDARelated to the CUDA backendbugSomething isn't workingperformanceSpeed related topics

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions