Automatically disable CUDA graphs for split mode "graph" #1040

ikawrakow · 2025-12-05T17:05:55Z

CUDA graphs cannot be used with split mode "graph". Capturing CUDA graphs does get disabled after a few failed attempts, but on at least one occasion we got an actual error (manifesting as a crash) while capturing a CUDA graph, so it is better to just disable a priori.

Once at it, I also added the ability to disable CUDA graphs via a command line argument

-cuda graphs=0

Just in case someone wants to see the performance impact of CUDA graphs without rebuilding or fooling around with environment variables.

ubergarm · 2025-12-05T22:50:09Z

but on at least one occasion we got an actual error (manifesting as a crash) while capturing a CUDA graph

I know -sm graph implementation is mainly for full GPU offload situation especially with 2x GPUs so far, but while working with Geechan over on BeaverAI Discord trying to test -sm graph on their rig they keep getting an error even trying with this PR.

I wanted to at least relay the details from the discord in case anyone else bumps into it as well.

They are using 2x RTX 8000 (Turing sm75) and updated drivers from older 550/12.4 driver/cuda to newer: NVIDIA: 580.105 CUDA: 13.0 but still no luck. It is a debian linux system.

$ llama-server --version
version: 4047 (912b8dcd)
built with cc (GCC) 15.2.1 20251112 for x86_64-pc-linux-gnu

$ llama-sweep-bench \   
    --n-gpu-layers 999 --threads 64 --threads-batch 64 --batch-size 4096 --ubatch-size 4096 --no-mmap \
    --override-tensor "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15)\.ffn_.*=CUDA0" \
    --override-tensor "blk\.(16|17|18|19|20|21|22|23|24|25|26|27|28)\.ffn_.*=CUDA1" \
    --override-tensor "blk\..*_exps\.=CPU" \
    --ctx-size 33000 -fa on -sm graph -cuda graphs=0 --host 0.0.0.0 \
    --model "/mnt/GLM/4.6/Q8-Q4-Q4-Q5/GLM-4.6-Q8_0-Q4_K-Q4_K-Q5_K-00001-of-00005.gguf" --alias GLM-Q8-Q4-Q4-Q5 \
    --chat-template chatglm4

CUDA error: an illegal memory access was encountered
  current device: 1, in function ggml_backend_cuda_synchronize at /mnt/builds/ik-llama.cpp-cuda/src/ik_llama.cpp/ggml/src/ggml-cuda.cu:3511
  cudaStreamSynchronize(cuda_ctx->stream())
/mnt/builds/ik-llama.cpp-cuda/src/ik_llama.cpp/ggml/src/ggml-cuda.cu:124: CUDA error

Still doing some of my own testing with GLM-4.6 hybrid CPU+2x GPUs and -sm graph seems to be working including -ger and giving maybe 20% better PP. I'll update about that later. Thanks!

ubergarm · 2025-12-06T00:40:02Z

As I too was curious how much benefit one might see from hybrid CPU + 2x GPU with -sm graph I ran a comparison with GLM-4.6-IQ5_K on the Thread Ripper Pro 24x Core + 2x RTX A6000. Turns out -sm graph wins again!

For fun I also added Kimi-K2-Thinkin-smol-IQ2_KS for comparison. It makes the graph harder to read, but its fun. (i forget if I actually used --merge-qkv on this or not lol, but honestly it is just rough comparison as the models are different sizes.) Kimi-K2-Thinking is quite a bit faster TG even without -sm graph support yet. It is also bigger but I didn't estimate the relative Active parameter sizes.

IQ5_K 249.099 GiB (5.997 BPW) (some of this is unused MTP tensors maybe ~5GBish)
smol-IQ2_KS 270.133 GiB (2.261 BPW)

Finally, I also tried -cuda graphs=0 which didn't have any difference than with it on.

👈 Details

ik_llama.cpp main@a3737f42 --merge-qkv

model=/mnt/raid/hf/GLM-4.6-GGUF/IQ5_K/GLM-4.6-IQ5_K-00001-of-00006.gguf
$ ./build/bin/llama-sweep-bench \
    --model "$model" \
    --ctx-size 53248 \
    -ger \
    --merge-qkv \
    -ngl 99 \
    -ot "blk\.(3|4|5|6|7|8|9|10|11)\.ffn_.*=CUDA0" \
    -ot "blk\.(12|13|14|15|16|17|18|19|20)\.ffn_.*=CUDA1" \
    --cpu-moe \
    -ub 4096 -b 4096 \
    --threads 24 \
    --no-mmap \
    --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	13.942	293.79	92.893	11.02
4096	1024	4096	14.872	275.41	97.536	10.50
4096	1024	8192	15.899	257.62	102.288	10.01
4096	1024	12288	17.000	240.94	107.196	9.55
4096	1024	16384	18.095	226.36	111.520	9.18
4096	1024	20480	19.159	213.79	115.670	8.85
4096	1024	24576	20.198	202.79	120.120	8.52
4096	1024	28672	21.276	192.52	124.717	8.21
4096	1024	32768	22.340	183.35	129.605	7.90
4096	1024	36864	23.409	174.98	134.510	7.61
4096	1024	40960	24.477	167.34	138.683	7.38
4096	1024	45056	25.573	160.17	142.969	7.16
4096	1024	49152	26.641	153.75	147.557	6.94

ik_llama.cpp main@a3737f42 -sm graph

model=/mnt/raid/hf/GLM-4.6-GGUF/IQ5_K/GLM-4.6-IQ5_K-00001-of-00006.gguf
$ ./build/bin/llama-sweep-bench \
    --model "$model" \
    --ctx-size 53248 \
    -ger \
    -sm graph \
    -ngl 99 \
    -ot "blk\.(3|4|5|6|7|8|9|10|11)\.ffn_.*=CUDA0" \
    -ot "blk\.(12|13|14|15|16|17|18|19|20)\.ffn_.*=CUDA1" \
    --cpu-moe \
    -ub 4096 -b 4096 \
    --threads 24 \
    --no-mmap \
    --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	11.484	356.66	95.390	10.73
4096	1024	4096	11.959	342.50	97.919	10.46
4096	1024	8192	12.471	328.45	99.796	10.26
4096	1024	12288	13.016	314.70	102.225	10.02
4096	1024	16384	13.560	302.06	103.547	9.89
4096	1024	20480	14.064	291.25	105.741	9.68
4096	1024	24576	14.594	280.67	106.967	9.57
4096	1024	28672	15.144	270.47	109.300	9.37
4096	1024	32768	15.630	262.06	110.923	9.23
4096	1024	36864	16.191	252.97	112.633	9.09
4096	1024	40960	16.699	245.28	114.815	8.92
4096	1024	45056	17.248	237.48	116.806	8.77
4096	1024	49152	17.773	230.46	119.310	8.58

ik_llama.cpp main@912b8dcd -merge-qkv -cuda graphs=0

model=/mnt/raid/hf/GLM-4.6-GGUF/IQ5_K/GLM-4.6-IQ5_K-00001-of-00006.gguf
$ ./build/bin/llama-sweep-bench \
    --model "$model" \
    --ctx-size 53248 \
    -ger \
    --merge-qkv \
    -cuda graphs=0 \
    -ngl 99 \
    -ot "blk\.(3|4|5|6|7|8|9|10|11)\.ffn_.*=CUDA0" \
    -ot "blk\.(12|13|14|15|16|17|18|19|20)\.ffn_.*=CUDA1" \
    --cpu-moe \
    -ub 4096 -b 4096 \
    --threads 24 \
    --no-mmap \
    --warmup-batch

 =========================== ggml_backend_cuda_init: setting use_cuda_graph to 0
 =========================== ggml_backend_cuda_init: setting use_cuda_graph to 0

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	13.915	294.35	92.326	11.09
4096	1024	4096	14.888	275.13	97.263	10.53
4096	1024	8192	15.897	257.66	101.856	10.05
4096	1024	12288	17.007	240.84	106.339	9.63
4096	1024	16384	18.063	226.76	110.962	9.23
4096	1024	20480	19.160	213.78	115.357	8.88
4096	1024	24576	20.181	202.97	120.070	8.53
4096	1024	28672	21.255	192.70	124.050	8.25
4096	1024	32768	22.312	183.58	128.810	7.95
4096	1024	36864	23.402	175.03	133.776	7.65
4096	1024	40960	24.482	167.31	138.454	7.40
4096	1024	45056	25.544	160.35	142.666	7.18
4096	1024	49152	26.630	153.81	147.041	6.96

ik_llama.cpp main@a3737f42 Kimi-K2-Thinking-smol-IQ2_KS

#!/usr/bin/env bash
model=/mnt/raid/hf/Kimi-K2-Thinking-GGUF/smol-IQ2_KS/Kimi-K2-Thinking-smol-IQ2_KS-00001-of-00006.gguf
$ ./build/bin/llama-server \
    --model "$model" \
    --alias ubergarm/Kimi-K2-Thinking-smol-IQ2_KS \
    -mla 3 -amb 512 \
    --ctx-size 65536 \
    -ctk q8_0 \
    -ngl 99 \
    -ot "blk\.(1|2|3|4|5|6|7)\.ffn_.*=CUDA0" \
    -ot "blk\.(8|9|10|11|12|13|14)\.ffn_.*=CUDA1" \
    --cpu-moe \
    -ub 4096 -b 4096 \
    --threads 24 \
    --no-mmap \
    --host 127.0.0.1 \
    --port 8080

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	13.494	303.54	56.963	17.98
4096	1024	4096	14.368	285.08	59.220	17.29
4096	1024	8192	15.328	267.22	60.872	16.82
4096	1024	12288	16.253	252.01	62.214	16.46
4096	1024	16384	17.316	236.55	64.172	15.96
4096	1024	20480	18.277	224.11	65.409	15.66
4096	1024	24576	19.172	213.64	67.083	15.26
4096	1024	28672	20.067	204.12	68.434	14.96
4096	1024	32768	21.449	190.97	70.045	14.62
4096	1024	36864	22.425	182.65	71.167	14.39
4096	1024	40960	23.414	174.94	72.679	14.09
4096	1024	45056	24.438	167.61	74.270	13.79
4096	1024	49152	25.275	162.06	76.089	13.46
4096	1024	53248	26.283	155.84	77.337	13.24
4096	1024	57344	27.351	149.76	78.800	12.99
4096	1024	61440	28.394	144.26	80.541	12.71
4096	1024	65536	29.623	138.27	81.933	12.50

Automatically disable CUDA graphs for split mode "graph"

912b8dc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Automatically disable CUDA graphs for split mode "graph" #1040

Automatically disable CUDA graphs for split mode "graph" #1040

ikawrakow commented Dec 5, 2025

Uh oh!

ubergarm commented Dec 5, 2025

Uh oh!

ubergarm commented Dec 6, 2025

ik_llama.cpp main@a3737f42 --merge-qkv

ik_llama.cpp main@a3737f42 -sm graph

ik_llama.cpp main@912b8dcd -merge-qkv -cuda graphs=0

ik_llama.cpp main@a3737f42 Kimi-K2-Thinking-smol-IQ2_KS

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Automatically disable CUDA graphs for split mode "graph" #1040

Are you sure you want to change the base?

Automatically disable CUDA graphs for split mode "graph" #1040

Conversation

ikawrakow commented Dec 5, 2025

Uh oh!

ubergarm commented Dec 5, 2025

Uh oh!

ubergarm commented Dec 6, 2025

ik_llama.cpp main@a3737f42 --merge-qkv

ik_llama.cpp main@a3737f42 -sm graph

ik_llama.cpp main@912b8dcd -merge-qkv -cuda graphs=0

ik_llama.cpp main@a3737f42 Kimi-K2-Thinking-smol-IQ2_KS

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants