Commit 465569d
Faster DeepSeek FA on CUDA (#408)
* New DeepSeek FlashMLA
Does not work because the RoPE portion is stored at the end
in our case, while in mainline it is stored at the beginning,
and the FA kernel assumes that.
* Rearrange MLA K cache so it first new CUDA FA implementation
* constexpr and minor changes
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>1 parent 8669c3d commit 465569d
File tree
4 files changed
+356
-130
lines changed- ggml/src
- ggml-cuda
- src
4 files changed
+356
-130
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
986 | 986 | | |
987 | 987 | | |
988 | 988 | | |
989 | | - | |
| 989 | + | |
990 | 990 | | |
991 | 991 | | |
992 | 992 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
5 | 15 | | |
6 | 16 | | |
7 | 17 | | |
| |||
0 commit comments