llama.cpp

History

Gaurav Garg c262beddf2 CUDA: Prefer vector flash decoding kernel for Gemma models (#12738 ) * Prefer vector flash decoding kernel for Gemma models Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category. Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models. * Update ggml/src/ggml-cuda/fattn.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>		2025-04-03 18:20:29 +02:00
..
template-instances	CUDA: optimize FA for GQA + large batches (#12014 )	2025-02-22 12:20:17 +01:00
vendors	HIP: Add support for RDNA4 targets (#12372 )	2025-03-26 23:46:30 +01:00
acc.cu	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
acc.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
arange.cu	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
arange.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
argmax.cu	cuda : optimize argmax (#10441 )	2024-11-21 18:18:50 +01:00
argmax.cuh	ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)	2024-10-03 21:17:26 +03:00
argsort.cu	ggml : reduce hash table reset cost (#8698 )	2024-07-27 04:41:55 +02:00
argsort.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
binbcast.cu	Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend (ggml/1121)	2025-03-03 18:18:11 +02:00
binbcast.cuh	ggml/examples: add backend support for numerical optimization (ggml/949)	2024-09-20 21:15:05 +03:00
clamp.cu	cuda: unary ops as float + de-duplicate (ggml/1130)	2025-03-03 18:18:11 +02:00
clamp.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
CMakeLists.txt	CUDA: compress mode option and default to size (#12029 )	2025-03-01 12:57:22 +01:00
common.cuh	Simplify and improve CUDA graphs through use of indirect copy pointers (#9017 )	2025-04-03 03:31:15 +02:00
concat.cu	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 )	2025-03-30 10:59:38 +02:00
concat.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
conv-transpose-1d.cu	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 )	2025-03-30 10:59:38 +02:00
conv-transpose-1d.cuh	feat: cuda implementation for `ggml_conv_transpose_1d` (ggml/854)	2024-07-08 12:23:00 +03:00
convert.cu	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 )	2025-03-30 10:59:38 +02:00
convert.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
count-equal.cu	ggml: fix zero division in ‘dne’ calculation in CUDA COUNT_EQUAL operator when ‘ne’ is small (#10213 )	2024-11-09 08:35:46 +01:00
count-equal.cuh	ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)	2024-10-03 21:17:26 +03:00
cp-async.cuh	CUDA: optimize FA for GQA + large batches (#12014 )	2025-02-22 12:20:17 +01:00
cpy.cu	Simplify and improve CUDA graphs through use of indirect copy pointers (#9017 )	2025-04-03 03:31:15 +02:00
cpy.cuh	Simplify and improve CUDA graphs through use of indirect copy pointers (#9017 )	2025-04-03 03:31:15 +02:00
cross-entropy-loss.cu	MUSA: support ARM64 and enable dp4a .etc (#11843 )	2025-02-21 09:46:23 +02:00
cross-entropy-loss.cuh	ggml/examples: add backend support for numerical optimization (ggml/949)	2024-09-20 21:15:05 +03:00
dequantize.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
diagmask.cu	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
diagmask.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
fattn-common.cuh	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 )	2025-03-30 10:59:38 +02:00
fattn-mma-f16.cuh	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 )	2025-03-30 10:59:38 +02:00
fattn-tile-f16.cu	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 )	2025-03-30 10:59:38 +02:00
fattn-tile-f16.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
fattn-tile-f32.cu	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 )	2025-03-30 10:59:38 +02:00
fattn-tile-f32.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
fattn-vec-f16.cuh	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 )	2025-03-30 10:59:38 +02:00
fattn-vec-f32.cuh	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 )	2025-03-30 10:59:38 +02:00
fattn-wmma-f16.cu	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 )	2025-03-30 10:59:38 +02:00
fattn-wmma-f16.cuh	CUDA: use mma PTX instructions for FlashAttention (#11583 )	2025-02-02 19:31:09 +01:00
fattn.cu	CUDA: Prefer vector flash decoding kernel for Gemma models (#12738 )	2025-04-03 18:20:29 +02:00
fattn.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
getrows.cu	CUDA: backwards pass for misc. ops, add tests (#11257 )	2025-01-16 16:43:38 +01:00
getrows.cuh	CUDA: backwards pass for misc. ops, add tests (#11257 )	2025-01-16 16:43:38 +01:00
ggml-cuda.cu	Simplify and improve CUDA graphs through use of indirect copy pointers (#9017 )	2025-04-03 03:31:15 +02:00
gla.cu	llama: add support for QRWKV6 model architecture (#11001 )	2025-01-10 09:58:08 +08:00
gla.cuh	llama: add support for QRWKV6 model architecture (#11001 )	2025-01-10 09:58:08 +08:00
im2col.cu	CUDA: fix 1D im2col, add tests (ggml/993)	2024-10-23 16:50:02 +03:00
im2col.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
mma.cuh	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 )	2025-03-30 10:59:38 +02:00
mmq.cu	HIP: Add support for RDNA4 targets (#12372 )	2025-03-26 23:46:30 +01:00
mmq.cuh	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 )	2025-03-30 10:59:38 +02:00
mmv.cu	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 )	2025-03-30 10:59:38 +02:00
mmv.cuh	CUDA: remove DMMV, consolidate F16 mult mat vec (#10318 )	2024-11-17 09:09:55 +01:00
mmvq.cu	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 )	2025-03-30 10:59:38 +02:00
mmvq.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
norm.cu	llama: Add support for RWKV v7 architecture (#12412 )	2025-03-18 07:27:50 +08:00
norm.cuh	llama: Add support for RWKV v7 architecture (#12412 )	2025-03-18 07:27:50 +08:00
opt-step-adamw.cu	ggml: new optimization interface (ggml/988)	2024-11-17 08:30:29 +02:00
opt-step-adamw.cuh	ggml/examples: add backend support for numerical optimization (ggml/949)	2024-09-20 21:15:05 +03:00
out-prod.cu	CPU/CUDA: fix (GQA) mul mat back, add CUDA support (#11380 )	2025-01-24 12:38:31 +01:00
out-prod.cuh	ggml/examples: add backend support for numerical optimization (ggml/949)	2024-09-20 21:15:05 +03:00
pad.cu	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 )	2025-03-30 10:59:38 +02:00
pad.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
pool2d.cu	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
pool2d.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
quantize.cu	cuda : optimize argmax (#10441 )	2024-11-21 18:18:50 +01:00
quantize.cuh	CUDA: optimize and refactor MMQ (#8416 )	2024-07-11 16:47:47 +02:00
rope.cu	CUDA: backwards pass for misc. ops, add tests (#11257 )	2025-01-16 16:43:38 +01:00
rope.cuh	RoPE: fix back, CUDA support for back + noncont. (#11240 )	2025-01-15 12:51:37 +01:00
scale.cu	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
scale.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
softmax.cu	HIP: fix flash_attn_stream_k_fixup warning (#11604 )	2025-02-02 23:48:29 +01:00
softmax.cuh	CUDA: backwards pass for misc. ops, add tests (#11257 )	2025-01-16 16:43:38 +01:00
ssm-conv.cu	fix MUSA compiler warning (#12704 )	2025-04-03 09:32:55 +02:00
ssm-conv.cuh	ggml : faster ssm scan (#10558 )	2025-03-31 18:05:13 +02:00
ssm-scan.cu	fix MUSA compiler warning (#12704 )	2025-04-03 09:32:55 +02:00
ssm-scan.cuh	ggml : faster ssm scan (#10558 )	2025-03-31 18:05:13 +02:00
sum.cu	CUDA: fix CUDART_VERSION checks (#11821 )	2025-02-12 13:16:39 +01:00
sum.cuh	tests: add gradient tests for all backends (ggml/932)	2024-09-08 11:05:55 +03:00
sumrows.cu	sync : ggml	2024-08-27 22:41:27 +03:00
sumrows.cuh	sync : ggml	2024-08-27 22:41:27 +03:00
tsembd.cu	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
tsembd.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
unary.cu	cuda: unary ops as float + de-duplicate (ggml/1130)	2025-03-03 18:18:11 +02:00
unary.cuh	cuda/cpu: Increase support for fp16 unary operations (ggml/1125)	2025-03-03 18:18:11 +02:00
upscale.cu	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 )	2025-03-30 10:59:38 +02:00
upscale.cuh	llama : reorganize source code + improve CMake (#8006 )	2024-06-26 18:33:02 +03:00
vecdotq.cuh	CUDA: MMQ code deduplication + iquant support (#8495 )	2024-07-20 22:25:26 +02:00
wkv.cu	llama: Add support for RWKV v7 architecture (#12412 )	2025-03-18 07:27:50 +08:00
wkv.cuh	llama: Add support for RWKV v7 architecture (#12412 )	2025-03-18 07:27:50 +08:00