-
Notifications
You must be signed in to change notification settings - Fork 162
Automatically disable CUDA graphs for split mode "graph" #1040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I know I wanted to at least relay the details from the discord in case anyone else bumps into it as well. They are using 2x RTX 8000 (Turing sm75) and updated drivers from older 550/12.4 driver/cuda to newer: NVIDIA: 580.105 CUDA: 13.0 but still no luck. It is a debian linux system. $ llama-server --version
version: 4047 (912b8dcd)
built with cc (GCC) 15.2.1 20251112 for x86_64-pc-linux-gnu
$ llama-sweep-bench \
--n-gpu-layers 999 --threads 64 --threads-batch 64 --batch-size 4096 --ubatch-size 4096 --no-mmap \
--override-tensor "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15)\.ffn_.*=CUDA0" \
--override-tensor "blk\.(16|17|18|19|20|21|22|23|24|25|26|27|28)\.ffn_.*=CUDA1" \
--override-tensor "blk\..*_exps\.=CPU" \
--ctx-size 33000 -fa on -sm graph -cuda graphs=0 --host 0.0.0.0 \
--model "/mnt/GLM/4.6/Q8-Q4-Q4-Q5/GLM-4.6-Q8_0-Q4_K-Q4_K-Q5_K-00001-of-00005.gguf" --alias GLM-Q8-Q4-Q4-Q5 \
--chat-template chatglm4
CUDA error: an illegal memory access was encountered
current device: 1, in function ggml_backend_cuda_synchronize at /mnt/builds/ik-llama.cpp-cuda/src/ik_llama.cpp/ggml/src/ggml-cuda.cu:3511
cudaStreamSynchronize(cuda_ctx->stream())
/mnt/builds/ik-llama.cpp-cuda/src/ik_llama.cpp/ggml/src/ggml-cuda.cu:124: CUDA errorStill doing some of my own testing with GLM-4.6 hybrid CPU+2x GPUs and |

CUDA graphs cannot be used with split mode "graph". Capturing CUDA graphs does get disabled after a few failed attempts, but on at least one occasion we got an actual error (manifesting as a crash) while capturing a CUDA graph, so it is better to just disable a priori.
Once at it, I also added the ability to disable CUDA graphs via a command line argument
Just in case someone wants to see the performance impact of CUDA graphs without rebuilding or fooling around with environment variables.