llama.cpp/ggml/src/ggml-vulkan
Jeff Bolz dc1d2adfc0
vulkan: scalar flash attention implementation (#13324)
* vulkan: scalar flash attention implementation

* vulkan: always use fp32 for scalar flash attention

* vulkan: use vector loads in scalar flash attention shader

* vulkan: remove PV matrix, helps with register usage

* vulkan: reduce register usage in scalar FA, but perf may be slightly worse

* vulkan: load each Q value once. optimize O reduction. more tuning

* vulkan: support q4_0/q8_0 KV in scalar FA

* CI: increase timeout to accommodate newly-supported tests

* vulkan: for scalar FA, select between 1 and 8 rows

* vulkan: avoid using Float16 capability in scalar FA
2025-05-10 08:07:07 +02:00
..
cmake cmake: fix ggml-shaders-gen compiler paths containing spaces (#12747) 2025-04-04 10:12:40 -03:00
vulkan-shaders vulkan: scalar flash attention implementation (#13324) 2025-05-10 08:07:07 +02:00
CMakeLists.txt vulkan: Add bfloat16 support (#12554) 2025-05-01 20:49:39 +02:00
ggml-vulkan.cpp vulkan: scalar flash attention implementation (#13324) 2025-05-10 08:07:07 +02:00