Eval bug: qwen 3 next CUDA still uses a lot of CPU during inference

### Name and Version

PS D:\llama\latest> .\llama-cli.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from D:\llama\latest\ggml-cuda.dll
load_backend: loaded RPC backend from D:\llama\latest\ggml-rpc.dll
load_backend: loaded CPU backend from D:\llama\latest\ggml-cpu-sapphirerapids.dll
version: 7300 (2960eb297)
built with Clang 19.1.5 for Windows x86_64

### Operating systems

Windows

### GGML backends

CUDA

### Hardware

RTX PRO 6000 and RTX A6000

### Models

qwen next testing with 
unsloth\Qwen3-Next-80B-A3B-Instruct-GGUF\Qwen3-Next-80B-A3B-Instruct-UD-Q4_K_XL

### Problem description & steps to reproduce

This doesn't seem to affect other models but when inferencing qwen 3 next looks like some things are still run on cpu which results in significant performance loss.
Screenshot was taken during tg stage for qwen next bench
<img width="1574" height="1282" alt="Image" src="https://github.com/user-attachments/assets/a637226a-828d-4322-959e-21d289737d5c" />

### First Bad Commit

_No response_

### Relevant log output

```shell
PS D:\llama> .\benchq3nxt.bat

D:\llama>set CUDA_VISIBLE_DEVICES=0

D:\llama>d:/llama/latest/llama-bench.exe   -m d:\models\unsloth\Qwen3-Next-80B-A3B-Instruct-GGUF\Qwen3-Next-80B-A3B-Instruct-UD-Q4_K_XL.gguf   
-p 512   -n 512   -b 1024   -ub 512   -ngl 99  -mmp 0   -fa 1   -o md   -r 3   -d 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from d:\llama\latest\ggml-cuda.dll
load_backend: loaded RPC backend from d:\llama\latest\ggml-rpc.dll
load_backend: loaded CPU backend from d:\llama\latest\ggml-cpu-sapphirerapids.dll
| model                          |       size |     params | backend    | ngl | n_batch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ---: | --------------: | -------------------: |
| qwen3next ?B Q4_K - Medium     |  42.01 GiB |    79.67 B | CUDA       |  99 |    1024 |  1 |    0 |           pp512 |      1211.32 ± 29.80 |
| qwen3next ?B Q4_K - Medium     |  42.01 GiB |    79.67 B | CUDA       |  99 |    1024 |  1 |    0 |           tg512 |         21.16 ± 0.20 |

build: 2960eb297 (7300)


```
## for comparison gpt-oss on the same build and hardware, cpu remains background level during pp and tg

```shell

PS D:\llama> .\bench.bat

D:\llama>set CUDA_VISIBLE_DEVICES=0

D:\llama>d:/llama/latest/llama-bench.exe   -m d:\models\lmstudio-community\gpt-oss-120b-GGUF\gpt-oss-120b-MXFP4-00001-of-00002.gguf   
-p 512   -n 512   -b 1024   -ub 512   -ngl 99  -mmp 0   -fa 1   -o md   -r 3   -d 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from d:\llama\latest\ggml-cuda.dll
load_backend: loaded RPC backend from d:\llama\latest\ggml-rpc.dll
load_backend: loaded CPU backend from d:\llama\latest\ggml-cpu-sapphirerapids.dll
| model                          |       size |     params | backend    | ngl | n_batch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |    1024 |  1 |    0 |           pp512 |     4716.71 ± 175.03 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |    1024 |  1 |    0 |           tg512 |        241.51 ± 1.02 |

build: 2960eb297 (7300)
PS D:\llama>
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: qwen 3 next CUDA still uses a lot of CPU during inference #17822

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

for comparison gpt-oss on the same build and hardware, cpu remains background level during pp and tg

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: qwen 3 next CUDA still uses a lot of CPU during inference #17822

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

for comparison gpt-oss on the same build and hardware, cpu remains background level during pp and tg

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions