llama.cpp/ggml/src/ggml-cuda/vendors
Gaurav Garg 517b5ddbf0
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183)
- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1

Fixes Issue: #12182
---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-03-19 20:52:06 +01:00
..
cuda.h CUDA: add BF16 support (#11093) 2025-01-06 02:33:52 +01:00
hip.h CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183) 2025-03-19 20:52:06 +01:00
musa.h CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183) 2025-03-19 20:52:06 +01:00