llama.cpp

History

Jeff Bolz c446b2edd2 vulkan: Submit once enough matmul work has been recorded (#12406 ) I've been seeing significantly worse performance for tg with flash attention enabled vs disabled, and it seems to be related to the submit heuristic. Change the heuristic to check how many bytes worth of weight matrix are used and flush every 100MB, and ramp up after the first few submits. This seems to resolve the issue, and also increases perf for non-FA a bit.		2025-03-19 08:26:26 +01:00
..
cmake	cmake : enable building llama.cpp using system libggml (#12321 )	2025-03-17 11:05:23 +02:00
include	llama: Add support for RWKV v7 architecture (#12412 )	2025-03-18 07:27:50 +08:00
src	vulkan: Submit once enough matmul work has been recorded (#12406 )	2025-03-19 08:26:26 +01:00
.gitignore	vulkan : cmake integration (#8119 )	2024-07-13 18:12:39 +02:00
CMakeLists.txt	SYCL: using graphs is configurable by environment variable and compile option (#12371 )	2025-03-18 11:16:31 +01:00