llama.cpp

History

Gaurav Garg 517b5ddbf0 CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183 ) - Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value. - Prefer vector flash attention kernels over MMA kernel for BS=1 Fixes Issue: #12182 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>		2025-03-19 20:52:06 +01:00
..
cuda.h	CUDA: add BF16 support (#11093 )	2025-01-06 02:33:52 +01:00
hip.h	CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183 )	2025-03-19 20:52:06 +01:00
musa.h	CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183 )	2025-03-19 20:52:06 +01:00