llama.cpp/ggml
Gaurav Garg 517b5ddbf0
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183)
- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1

Fixes Issue: #12182
---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-03-19 20:52:06 +01:00
..
cmake cmake : enable building llama.cpp using system libggml (#12321) 2025-03-17 11:05:23 +02:00
include llama: Add support for RWKV v7 architecture (#12412) 2025-03-18 07:27:50 +08:00
src CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183) 2025-03-19 20:52:06 +01:00
.gitignore vulkan : cmake integration (#8119) 2024-07-13 18:12:39 +02:00
CMakeLists.txt SYCL: using graphs is configurable by environment variable and compile option (#12371) 2025-03-18 11:16:31 +01:00