llama.cpp/ggml/src/ggml-cuda
Gaurav Garg c262beddf2
CUDA: Prefer vector flash decoding kernel for Gemma models (#12738)
* Prefer vector flash decoding kernel for Gemma models

Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category.
Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models.

* Update ggml/src/ggml-cuda/fattn.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-04-03 18:20:29 +02:00
..
template-instances CUDA: optimize FA for GQA + large batches (#12014) 2025-02-22 12:20:17 +01:00
vendors HIP: Add support for RDNA4 targets (#12372) 2025-03-26 23:46:30 +01:00
acc.cu llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
acc.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
arange.cu llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
arange.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
argmax.cu cuda : optimize argmax (#10441) 2024-11-21 18:18:50 +01:00
argmax.cuh ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980) 2024-10-03 21:17:26 +03:00
argsort.cu ggml : reduce hash table reset cost (#8698) 2024-07-27 04:41:55 +02:00
argsort.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
binbcast.cu Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend (ggml/1121) 2025-03-03 18:18:11 +02:00
binbcast.cuh ggml/examples: add backend support for numerical optimization (ggml/949) 2024-09-20 21:15:05 +03:00
clamp.cu cuda: unary ops as float + de-duplicate (ggml/1130) 2025-03-03 18:18:11 +02:00
clamp.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
CMakeLists.txt CUDA: compress mode option and default to size (#12029) 2025-03-01 12:57:22 +01:00
common.cuh Simplify and improve CUDA graphs through use of indirect copy pointers (#9017) 2025-04-03 03:31:15 +02:00
concat.cu musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc (#12611) 2025-03-30 10:59:38 +02:00
concat.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
conv-transpose-1d.cu musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc (#12611) 2025-03-30 10:59:38 +02:00
conv-transpose-1d.cuh feat: cuda implementation for ggml_conv_transpose_1d (ggml/854) 2024-07-08 12:23:00 +03:00
convert.cu musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc (#12611) 2025-03-30 10:59:38 +02:00
convert.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
count-equal.cu ggml: fix zero division in ‘dne’ calculation in CUDA COUNT_EQUAL operator when ‘ne’ is small (#10213) 2024-11-09 08:35:46 +01:00
count-equal.cuh ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980) 2024-10-03 21:17:26 +03:00
cp-async.cuh CUDA: optimize FA for GQA + large batches (#12014) 2025-02-22 12:20:17 +01:00
cpy.cu Simplify and improve CUDA graphs through use of indirect copy pointers (#9017) 2025-04-03 03:31:15 +02:00
cpy.cuh Simplify and improve CUDA graphs through use of indirect copy pointers (#9017) 2025-04-03 03:31:15 +02:00
cross-entropy-loss.cu MUSA: support ARM64 and enable dp4a .etc (#11843) 2025-02-21 09:46:23 +02:00
cross-entropy-loss.cuh ggml/examples: add backend support for numerical optimization (ggml/949) 2024-09-20 21:15:05 +03:00
dequantize.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
diagmask.cu llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
diagmask.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
fattn-common.cuh musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc (#12611) 2025-03-30 10:59:38 +02:00
fattn-mma-f16.cuh musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc (#12611) 2025-03-30 10:59:38 +02:00
fattn-tile-f16.cu musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc (#12611) 2025-03-30 10:59:38 +02:00
fattn-tile-f16.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
fattn-tile-f32.cu musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc (#12611) 2025-03-30 10:59:38 +02:00
fattn-tile-f32.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
fattn-vec-f16.cuh musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc (#12611) 2025-03-30 10:59:38 +02:00
fattn-vec-f32.cuh musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc (#12611) 2025-03-30 10:59:38 +02:00
fattn-wmma-f16.cu musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc (#12611) 2025-03-30 10:59:38 +02:00
fattn-wmma-f16.cuh CUDA: use mma PTX instructions for FlashAttention (#11583) 2025-02-02 19:31:09 +01:00
fattn.cu CUDA: Prefer vector flash decoding kernel for Gemma models (#12738) 2025-04-03 18:20:29 +02:00
fattn.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
getrows.cu CUDA: backwards pass for misc. ops, add tests (#11257) 2025-01-16 16:43:38 +01:00
getrows.cuh CUDA: backwards pass for misc. ops, add tests (#11257) 2025-01-16 16:43:38 +01:00
ggml-cuda.cu Simplify and improve CUDA graphs through use of indirect copy pointers (#9017) 2025-04-03 03:31:15 +02:00
gla.cu llama: add support for QRWKV6 model architecture (#11001) 2025-01-10 09:58:08 +08:00
gla.cuh llama: add support for QRWKV6 model architecture (#11001) 2025-01-10 09:58:08 +08:00
im2col.cu CUDA: fix 1D im2col, add tests (ggml/993) 2024-10-23 16:50:02 +03:00
im2col.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
mma.cuh musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc (#12611) 2025-03-30 10:59:38 +02:00
mmq.cu HIP: Add support for RDNA4 targets (#12372) 2025-03-26 23:46:30 +01:00
mmq.cuh musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc (#12611) 2025-03-30 10:59:38 +02:00
mmv.cu musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc (#12611) 2025-03-30 10:59:38 +02:00
mmv.cuh CUDA: remove DMMV, consolidate F16 mult mat vec (#10318) 2024-11-17 09:09:55 +01:00
mmvq.cu musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc (#12611) 2025-03-30 10:59:38 +02:00
mmvq.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
norm.cu llama: Add support for RWKV v7 architecture (#12412) 2025-03-18 07:27:50 +08:00
norm.cuh llama: Add support for RWKV v7 architecture (#12412) 2025-03-18 07:27:50 +08:00
opt-step-adamw.cu ggml: new optimization interface (ggml/988) 2024-11-17 08:30:29 +02:00
opt-step-adamw.cuh ggml/examples: add backend support for numerical optimization (ggml/949) 2024-09-20 21:15:05 +03:00
out-prod.cu CPU/CUDA: fix (GQA) mul mat back, add CUDA support (#11380) 2025-01-24 12:38:31 +01:00
out-prod.cuh ggml/examples: add backend support for numerical optimization (ggml/949) 2024-09-20 21:15:05 +03:00
pad.cu musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc (#12611) 2025-03-30 10:59:38 +02:00
pad.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
pool2d.cu llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
pool2d.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
quantize.cu cuda : optimize argmax (#10441) 2024-11-21 18:18:50 +01:00
quantize.cuh CUDA: optimize and refactor MMQ (#8416) 2024-07-11 16:47:47 +02:00
rope.cu CUDA: backwards pass for misc. ops, add tests (#11257) 2025-01-16 16:43:38 +01:00
rope.cuh RoPE: fix back, CUDA support for back + noncont. (#11240) 2025-01-15 12:51:37 +01:00
scale.cu llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
scale.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
softmax.cu HIP: fix flash_attn_stream_k_fixup warning (#11604) 2025-02-02 23:48:29 +01:00
softmax.cuh CUDA: backwards pass for misc. ops, add tests (#11257) 2025-01-16 16:43:38 +01:00
ssm-conv.cu fix MUSA compiler warning (#12704) 2025-04-03 09:32:55 +02:00
ssm-conv.cuh ggml : faster ssm scan (#10558) 2025-03-31 18:05:13 +02:00
ssm-scan.cu fix MUSA compiler warning (#12704) 2025-04-03 09:32:55 +02:00
ssm-scan.cuh ggml : faster ssm scan (#10558) 2025-03-31 18:05:13 +02:00
sum.cu CUDA: fix CUDART_VERSION checks (#11821) 2025-02-12 13:16:39 +01:00
sum.cuh tests: add gradient tests for all backends (ggml/932) 2024-09-08 11:05:55 +03:00
sumrows.cu sync : ggml 2024-08-27 22:41:27 +03:00
sumrows.cuh sync : ggml 2024-08-27 22:41:27 +03:00
tsembd.cu llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
tsembd.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
unary.cu cuda: unary ops as float + de-duplicate (ggml/1130) 2025-03-03 18:18:11 +02:00
unary.cuh cuda/cpu: Increase support for fp16 unary operations (ggml/1125) 2025-03-03 18:18:11 +02:00
upscale.cu musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc (#12611) 2025-03-30 10:59:38 +02:00
upscale.cuh llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
vecdotq.cuh CUDA: MMQ code deduplication + iquant support (#8495) 2024-07-20 22:25:26 +02:00
wkv.cu llama: Add support for RWKV v7 architecture (#12412) 2025-03-18 07:27:50 +08:00
wkv.cuh llama: Add support for RWKV v7 architecture (#12412) 2025-03-18 07:27:50 +08:00