llama.cpp

History

Gaurav Garg 517b5ddbf0 CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183 ) - Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value. - Prefer vector flash attention kernels over MMA kernel for BS=1 Fixes Issue: #12182 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>		2025-03-19 20:52:06 +01:00
..
cmake	cmake : enable building llama.cpp using system libggml (#12321 )	2025-03-17 11:05:23 +02:00
include	llama: Add support for RWKV v7 architecture (#12412 )	2025-03-18 07:27:50 +08:00
src	CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183 )	2025-03-19 20:52:06 +01:00
.gitignore	vulkan : cmake integration (#8119 )	2024-07-13 18:12:39 +02:00
CMakeLists.txt	SYCL: using graphs is configurable by environment variable and compile option (#12371 )	2025-03-18 11:16:31 +01:00