Skip to content

Conversation

@ikawrakow
Copy link
Owner

CUDA graphs cannot be used with split mode "graph". Capturing CUDA graphs does get disabled after a few failed attempts, but on at least one occasion we got an actual error (manifesting as a crash) while capturing a CUDA graph, so it is better to just disable a priori.

Once at it, I also added the ability to disable CUDA graphs via a command line argument

-cuda graphs=0

Just in case someone wants to see the performance impact of CUDA graphs without rebuilding or fooling around with environment variables.

@ubergarm
Copy link
Contributor

ubergarm commented Dec 5, 2025

but on at least one occasion we got an actual error (manifesting as a crash) while capturing a CUDA graph

I know -sm graph implementation is mainly for full GPU offload situation especially with 2x GPUs so far, but while working with Geechan over on BeaverAI Discord trying to test -sm graph on their rig they keep getting an error even trying with this PR.

I wanted to at least relay the details from the discord in case anyone else bumps into it as well.

They are using 2x RTX 8000 (Turing sm75) and updated drivers from older 550/12.4 driver/cuda to newer: NVIDIA: 580.105 CUDA: 13.0 but still no luck. It is a debian linux system.

$ llama-server --version
version: 4047 (912b8dcd)
built with cc (GCC) 15.2.1 20251112 for x86_64-pc-linux-gnu

$ llama-sweep-bench \   
    --n-gpu-layers 999 --threads 64 --threads-batch 64 --batch-size 4096 --ubatch-size 4096 --no-mmap \
    --override-tensor "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15)\.ffn_.*=CUDA0" \
    --override-tensor "blk\.(16|17|18|19|20|21|22|23|24|25|26|27|28)\.ffn_.*=CUDA1" \
    --override-tensor "blk\..*_exps\.=CPU" \
    --ctx-size 33000 -fa on -sm graph -cuda graphs=0 --host 0.0.0.0 \
    --model "/mnt/GLM/4.6/Q8-Q4-Q4-Q5/GLM-4.6-Q8_0-Q4_K-Q4_K-Q5_K-00001-of-00005.gguf" --alias GLM-Q8-Q4-Q4-Q5 \
    --chat-template chatglm4

CUDA error: an illegal memory access was encountered
  current device: 1, in function ggml_backend_cuda_synchronize at /mnt/builds/ik-llama.cpp-cuda/src/ik_llama.cpp/ggml/src/ggml-cuda.cu:3511
  cudaStreamSynchronize(cuda_ctx->stream())
/mnt/builds/ik-llama.cpp-cuda/src/ik_llama.cpp/ggml/src/ggml-cuda.cu:124: CUDA error

Still doing some of my own testing with GLM-4.6 hybrid CPU+2x GPUs and -sm graph seems to be working including -ger and giving maybe 20% better PP. I'll update about that later. Thanks!

@ubergarm
Copy link
Contributor

ubergarm commented Dec 6, 2025

As I too was curious how much benefit one might see from hybrid CPU + 2x GPU with -sm graph I ran a comparison with GLM-4.6-IQ5_K on the Thread Ripper Pro 24x Core + 2x RTX A6000. Turns out -sm graph wins again!

For fun I also added Kimi-K2-Thinkin-smol-IQ2_KS for comparison. It makes the graph harder to read, but its fun. (i forget if I actually used --merge-qkv on this or not lol, but honestly it is just rough comparison as the models are different sizes.) Kimi-K2-Thinking is quite a bit faster TG even without -sm graph support yet. It is also bigger but I didn't estimate the relative Active parameter sizes.

  • IQ5_K 249.099 GiB (5.997 BPW) (some of this is unused MTP tensors maybe ~5GBish)
  • smol-IQ2_KS 270.133 GiB (2.261 BPW)

Finally, I also tried -cuda graphs=0 which didn't have any difference than with it on.

sweep-bench-GLM-4 6-IQ5_K
👈 Details

ik_llama.cpp main@a3737f42 --merge-qkv

model=/mnt/raid/hf/GLM-4.6-GGUF/IQ5_K/GLM-4.6-IQ5_K-00001-of-00006.gguf
$ ./build/bin/llama-sweep-bench \
    --model "$model" \
    --ctx-size 53248 \
    -ger \
    --merge-qkv \
    -ngl 99 \
    -ot "blk\.(3|4|5|6|7|8|9|10|11)\.ffn_.*=CUDA0" \
    -ot "blk\.(12|13|14|15|16|17|18|19|20)\.ffn_.*=CUDA1" \
    --cpu-moe \
    -ub 4096 -b 4096 \
    --threads 24 \
    --no-mmap \
    --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 13.942 293.79 92.893 11.02
4096 1024 4096 14.872 275.41 97.536 10.50
4096 1024 8192 15.899 257.62 102.288 10.01
4096 1024 12288 17.000 240.94 107.196 9.55
4096 1024 16384 18.095 226.36 111.520 9.18
4096 1024 20480 19.159 213.79 115.670 8.85
4096 1024 24576 20.198 202.79 120.120 8.52
4096 1024 28672 21.276 192.52 124.717 8.21
4096 1024 32768 22.340 183.35 129.605 7.90
4096 1024 36864 23.409 174.98 134.510 7.61
4096 1024 40960 24.477 167.34 138.683 7.38
4096 1024 45056 25.573 160.17 142.969 7.16
4096 1024 49152 26.641 153.75 147.557 6.94

ik_llama.cpp main@a3737f42 -sm graph

model=/mnt/raid/hf/GLM-4.6-GGUF/IQ5_K/GLM-4.6-IQ5_K-00001-of-00006.gguf
$ ./build/bin/llama-sweep-bench \
    --model "$model" \
    --ctx-size 53248 \
    -ger \
    -sm graph \
    -ngl 99 \
    -ot "blk\.(3|4|5|6|7|8|9|10|11)\.ffn_.*=CUDA0" \
    -ot "blk\.(12|13|14|15|16|17|18|19|20)\.ffn_.*=CUDA1" \
    --cpu-moe \
    -ub 4096 -b 4096 \
    --threads 24 \
    --no-mmap \
    --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 11.484 356.66 95.390 10.73
4096 1024 4096 11.959 342.50 97.919 10.46
4096 1024 8192 12.471 328.45 99.796 10.26
4096 1024 12288 13.016 314.70 102.225 10.02
4096 1024 16384 13.560 302.06 103.547 9.89
4096 1024 20480 14.064 291.25 105.741 9.68
4096 1024 24576 14.594 280.67 106.967 9.57
4096 1024 28672 15.144 270.47 109.300 9.37
4096 1024 32768 15.630 262.06 110.923 9.23
4096 1024 36864 16.191 252.97 112.633 9.09
4096 1024 40960 16.699 245.28 114.815 8.92
4096 1024 45056 17.248 237.48 116.806 8.77
4096 1024 49152 17.773 230.46 119.310 8.58

ik_llama.cpp main@912b8dcd -merge-qkv -cuda graphs=0

model=/mnt/raid/hf/GLM-4.6-GGUF/IQ5_K/GLM-4.6-IQ5_K-00001-of-00006.gguf
$ ./build/bin/llama-sweep-bench \
    --model "$model" \
    --ctx-size 53248 \
    -ger \
    --merge-qkv \
    -cuda graphs=0 \
    -ngl 99 \
    -ot "blk\.(3|4|5|6|7|8|9|10|11)\.ffn_.*=CUDA0" \
    -ot "blk\.(12|13|14|15|16|17|18|19|20)\.ffn_.*=CUDA1" \
    --cpu-moe \
    -ub 4096 -b 4096 \
    --threads 24 \
    --no-mmap \
    --warmup-batch

 =========================== ggml_backend_cuda_init: setting use_cuda_graph to 0
 =========================== ggml_backend_cuda_init: setting use_cuda_graph to 0
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 13.915 294.35 92.326 11.09
4096 1024 4096 14.888 275.13 97.263 10.53
4096 1024 8192 15.897 257.66 101.856 10.05
4096 1024 12288 17.007 240.84 106.339 9.63
4096 1024 16384 18.063 226.76 110.962 9.23
4096 1024 20480 19.160 213.78 115.357 8.88
4096 1024 24576 20.181 202.97 120.070 8.53
4096 1024 28672 21.255 192.70 124.050 8.25
4096 1024 32768 22.312 183.58 128.810 7.95
4096 1024 36864 23.402 175.03 133.776 7.65
4096 1024 40960 24.482 167.31 138.454 7.40
4096 1024 45056 25.544 160.35 142.666 7.18
4096 1024 49152 26.630 153.81 147.041 6.96

ik_llama.cpp main@a3737f42 Kimi-K2-Thinking-smol-IQ2_KS

#!/usr/bin/env bash
model=/mnt/raid/hf/Kimi-K2-Thinking-GGUF/smol-IQ2_KS/Kimi-K2-Thinking-smol-IQ2_KS-00001-of-00006.gguf
$ ./build/bin/llama-server \
    --model "$model" \
    --alias ubergarm/Kimi-K2-Thinking-smol-IQ2_KS \
    -mla 3 -amb 512 \
    --ctx-size 65536 \
    -ctk q8_0 \
    -ngl 99 \
    -ot "blk\.(1|2|3|4|5|6|7)\.ffn_.*=CUDA0" \
    -ot "blk\.(8|9|10|11|12|13|14)\.ffn_.*=CUDA1" \
    --cpu-moe \
    -ub 4096 -b 4096 \
    --threads 24 \
    --no-mmap \
    --host 127.0.0.1 \
    --port 8080
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 13.494 303.54 56.963 17.98
4096 1024 4096 14.368 285.08 59.220 17.29
4096 1024 8192 15.328 267.22 60.872 16.82
4096 1024 12288 16.253 252.01 62.214 16.46
4096 1024 16384 17.316 236.55 64.172 15.96
4096 1024 20480 18.277 224.11 65.409 15.66
4096 1024 24576 19.172 213.64 67.083 15.26
4096 1024 28672 20.067 204.12 68.434 14.96
4096 1024 32768 21.449 190.97 70.045 14.62
4096 1024 36864 22.425 182.65 71.167 14.39
4096 1024 40960 23.414 174.94 72.679 14.09
4096 1024 45056 24.438 167.61 74.270 13.79
4096 1024 49152 25.275 162.06 76.089 13.46
4096 1024 53248 26.283 155.84 77.337 13.24
4096 1024 57344 27.351 149.76 78.800 12.99
4096 1024 61440 28.394 144.26 80.541 12.71
4096 1024 65536 29.623 138.27 81.933 12.50

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants