llama.cpp

Author	SHA1	Message	Date
Atharva Dubey	02cdd2d8b0	sycl: simplify bin_bcast_kernel (#13383 )	2025-05-15 17:39:52 +02:00
Svetlozar Georgiev	64bb51cf90	sycl: reordered Q4_K MMVQ (#13109 )	2025-05-15 17:35:44 +02:00
Łukasz Ślusarczyk	9c404ed54c	sycl: use oneDNN for matrices multiplication (#12972 )	2025-05-15 16:53:41 +02:00
Yibo Cai	5ab5d5fb25	arm64: optimize q6_k_q8_k kernel with i8mm (#13519 ) This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction. Tested on neoverse-n2 with llama3 8b q6_k quantization model. - 40% ~ 54% S_PP uplift for all batch sizes - 16% ~ 47% S_TG uplift for batch size 4 and above Perplexity doesn't change with this PR. ``` // tested on neoverse-n2 $ llama-batched-bench \ -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \ --no-mmap -fa \ -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \ -npl 1,2,4,8,16,32 \ -t 64 --------------------------------------------------------------------- \| PP \| TG \| B \| S_PP t/s \| S_TG t/s \| \| \| \| \| original \| this pr \| original \| this pr \| \|-------\|--------\|------\|----------\|----------\|----------\|----------\| \| 128 \| 128 \| 1 \| 78.52 \| 109.18 \| 18.63 \| 18.88 \| \| 128 \| 128 \| 2 \| 84.62 \| 123.94 \| 34.54 \| 36.92 \| \| 128 \| 128 \| 4 \| 84.36 \| 122.49 \| 52.65 \| 61.32 \| \| 128 \| 128 \| 8 \| 90.52 \| 138.87 \| 63.46 \| 84.41 \| \| 128 \| 128 \| 16 \| 90.11 \| 138.56 \| 71.04 \| 101.33 \| \| 128 \| 128 \| 32 \| 89.81 \| 137.79 \| 75.14 \| 110.47 \| --------------------------------------------------------------------- ```	2025-05-14 21:53:52 +02:00
Johannes Gäßler	4696d56749	CUDA: fix crash on large batch size for quant. MoE (#13537 )	2025-05-14 16:41:02 +02:00
Johannes Gäßler	6da34fa276	CUDA: faster Deepseek FA, add Turing support (#13435 )	2025-05-14 16:08:20 +02:00
bandoti	09d13d94fb	cmake: simplify vulkan shader test logic (#13263 )	2025-05-14 07:53:57 -03:00
Jeff Bolz	24e86cae72	vulkan: KHR_coopmat flash attention (#13506 ) This shader uses coopmat1 to do the QK^T multiply. The PV multiply is more difficult for various reasons so I haven't done it. Performance for this shader is around 2.5x better than for the scalar shader when doing prompt processing. Some of the benefit may be from other optimizations like staging through shared memory, or splitting by rows.	2025-05-14 11:55:26 +02:00
Jeff Bolz	ab3971f2a0	vulkan: workaround FA compile failures on macos (#13517 )	2025-05-14 06:15:50 +02:00
Georgi Gerganov	f0995d28ce	metal : use FA-vec kernel up to batch size 20 (#13496 ) * batched-bench : fix pp batch contents * metal : optimize multi-sequence FA vec kernel ggml-ci * metal : use FA-vec kernel up to batch size 20 ggml-ci	2025-05-13 18:04:39 +03:00
Georgi Gerganov	c252e0c409	metal : optimize multi-sequence FA vec kernel (#13493 ) * batched-bench : fix pp batch contents * metal : optimize multi-sequence FA vec kernel ggml-ci	2025-05-13 18:04:00 +03:00
Dan Johansson	4f711afed5	ggml-cpu: Update KleidiAI to v1.6 and fix include directives (#13509 ) Signed-off-by: Dan Johansson <dan.johansson@arm.com>	2025-05-13 18:02:28 +03:00
lhez	f0d46ef157	opencl: remove unnecessary assert for `add` (#13257 )	2025-05-12 13:13:49 -07:00
Johannes Gäßler	10d2af0eaa	llama/ggml: add LLM training support (#10544 ) * llama/ggml: add LLM training support more compact progress bar llama_save_model_to_file llama_opt_param_filter ggml_graph_dup force_grads refactor ggml_opt, fix test-opt * remove logits_all * refactor CUDA implementation for ACC * reset graph at beginning of opt period	2025-05-12 14:44:49 +02:00
Dan Johansson	a71a4075cd	ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel (#13053 ) * ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel Signed-off-by: Dan Johansson <dan.johansson@arm.com> * * code review fixes Signed-off-by: Dan Johansson <dan.johansson@arm.com> * * adds a comment that clarifies barrier usage Signed-off-by: Dan Johansson <dan.johansson@arm.com> --------- Signed-off-by: Dan Johansson <dan.johansson@arm.com> Co-authored-by: Charles Xu <charles.xu@arm.com>	2025-05-12 13:06:19 +02:00
Johannes Gäßler	95e18884fc	CUDA: fix misaligned synchronization in FA (#13469 )	2025-05-12 10:51:21 +02:00
Xuan-Son Nguyen	df8491922f	ggml : add mrope kernel for metal (#13457 )	2025-05-12 10:29:13 +02:00
Atharva Dubey	14492144c2	enable dpcpp nightly builds with libraries (#13406 )	2025-05-12 13:15:32 +08:00
Johannes Gäßler	7474e00b34	CUDA: fix crash with partial offloading of MoE (#13439 )	2025-05-11 16:09:33 +02:00
David Huang	7f323a589f	Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (#13386 )	2025-05-11 14:18:39 +02:00
Johannes Gäßler	0208355f42	CUDA: fix race conditions FlashAttention kernels (#13438 )	2025-05-10 22:22:48 +02:00
Johannes Gäßler	d8919424f1	CUDA: fix FlashAttention on Turing (#13415 )	2025-05-10 09:16:52 +02:00
Jeff Bolz	dc1d2adfc0	vulkan: scalar flash attention implementation (#13324 ) * vulkan: scalar flash attention implementation * vulkan: always use fp32 for scalar flash attention * vulkan: use vector loads in scalar flash attention shader * vulkan: remove PV matrix, helps with register usage * vulkan: reduce register usage in scalar FA, but perf may be slightly worse * vulkan: load each Q value once. optimize O reduction. more tuning * vulkan: support q4_0/q8_0 KV in scalar FA * CI: increase timeout to accommodate newly-supported tests * vulkan: for scalar FA, select between 1 and 8 rows * vulkan: avoid using Float16 capability in scalar FA	2025-05-10 08:07:07 +02:00
Alberto Cabrera Pérez	17512a94d6	sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (#12858 ) * sycl : Implemented reorder Q4_0 mmvq Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com> * sycl : Fixed mmvq being called when reorder is disabled * sycl : Improved comments in the quants header Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com> * Use static_assert * safe_div -> ceil_div * Clarify qi comment * change the reorder tensor from init to execute OP * dbg * Undo changes to test-backend-ops * Refactor changes on top of q4_0 reorder fix * Missing Reverts * Refactored opt_for_reorder logic to simplify code path * Explicit inlining and unroll * Renamed mul_mat_algo enum for consistency --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com> Co-authored-by: romain.biessy <romain.biessy@codeplay.com>	2025-05-09 16:34:08 +01:00
Georgi Gerganov	611aa914ef	metal : optimize MoE for large batches (#13388 ) ggml-ci	2025-05-09 15:14:56 +03:00
Johannes Gäßler	0cf6725e9f	CUDA: FA support for Deepseek (Ampere or newer) (#13306 ) * CUDA: FA support for Deepseek (Ampere or newer) * do loop unrolling via C++ template	2025-05-09 13:34:58 +02:00
Johannes Gäßler	5c86c9ed3e	CUDA: fix crash on large batch size for MoE models (#13384 )	2025-05-09 12:14:04 +02:00
Radoslav Gerganov	b486ba05bf	rpc : add rpc_msg_set_tensor_hash_req (#13353 ) * rpc : add rpc_msg_set_tensor_hash_req Use a dedicated struct for the request of RPC_CMD_SET_TENSOR_HASH which makes the code cleaner. * fix	2025-05-09 10:31:07 +03:00
Jeff Bolz	02115dcd9a	vulkan: Allow up to 4096 elements for mul_mat_id row_ids (#13326 ) This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf: GGML_ASSERT(nei0 * nei1 <= 3072); The tensor is 8 x 512. Increase this array size to accommodate.	2025-05-09 09:23:41 +02:00
Alberto Cabrera Pérez	8733e0cf6e	sycl: addressing non-contiguous src1 mul_mats (nc and batched) (#13343 ) * sycl: fixed non-contiguous src1 mul_mats (nc and batched) * Fixed wrong static_cast inside kernel	2025-05-08 10:08:01 +01:00
Daniel Bevenius	13b0a04597	whisper: remove MSVC warnings pragmas (whisper/3090) * ggml : remove MSVC warnings pragmas This commit removes the MSVC-specific pragmas as these are now handled in ggml/CMakeLists.txt. * whisper : remove MSVC warning pragmas This commit removes the MSVC-specific pragmas. These are now handled in the ggml/CMakeLists.txt file.	2025-05-07 17:28:36 +03:00
Jared Tweed	bba9d945c1	cmake : removed stdc++fs (whisper/3097) * removed stdc++fs * kept line, but removed stdc++fs	2025-05-07 17:28:36 +03:00
R0CKSTAR	1f73301b63	cuda : remove nrows_x in mul_mat_q_process_tile (#13325 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-05-07 09:48:23 +02:00
Johannes Gäßler	141a908a59	CUDA: mix virt/real CUDA archs for GGML_NATIVE=OFF (#13135 )	2025-05-06 23:35:51 +02:00
Akarshan Biswas	1e333d5bba	SYCL: Disable reorder optimize by default and stop setting tensor extras when optimize is disabled (#13254 ) * SYCL: Do not set tensor extras when reorder optimize is disabled * SYCL: Disable reorder optimize by default	2025-05-06 20:27:06 +05:30
Johannes Gäßler	2356fb1d53	CUDA: fix bad asserts for partial offload (#13337 )	2025-05-06 13:58:51 +02:00
Johannes Gäßler	15a28ec8c7	CUDA: fix --split-mode row for MMQ (#13323 )	2025-05-06 08:36:46 +02:00
Johannes Gäßler	9070365020	CUDA: fix logic for clearing padding with -ngl 0 (#13320 )	2025-05-05 22:32:13 +02:00
Akarshan Biswas	66645a5285	SYCL: Disable mul_mat kernels for noncontiguous tensor b (#13308 ) ggml-ci	2025-05-05 13:39:10 +05:30
Diego Devesa	9fdfcdaedd	rpc : use backend registry, support dl backends (#13304 )	2025-05-04 21:25:43 +02:00
Aaron Teo	6eb7d25c70	ggml : activate s390x simd for Q3_K (#13301 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-05-04 19:49:12 +02:00
Johannes Gäßler	93c4e23905	CUDA: fix race condition in MMQ stream-k fixup (#13299 )	2025-05-04 14:16:39 +02:00
Johannes Gäßler	8afbd96818	CUDA: fix race condition in MMQ ids_dst (#13294 )	2025-05-04 13:58:38 +02:00
Jeff Bolz	8ae5ebcf85	vulkan: Additional type support for unary, binary, and copy (#13266 ) Support f16->f32 copy. Support f16->f16 and f32->f32 unary ops. Support all combinations of f16/f32 for src0/src1/dst for add/sub/mul/div.	2025-05-04 07:17:16 +02:00
Georgi Gerganov	b34443923c	sync : ggml (#13268 ) * vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) (ggml/1204) * vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW) * review: remove src_x/y < 0 checks; add performance tests * sync : ggml ggml-ci * vulkan : fix lint (#0) --------- Co-authored-by: Acly <aclysia@gmail.com>	2025-05-02 20:54:30 +03:00
shalinib-ibm	3f3769ba76	ggml : Enable MMA for BF16 in llamafile_sgemm (#13148 ) This patch upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for BF16 data type. This change results in 9x - 40x gains in total speed S t/s (ie all tokens/total time), across various batch sizes tested using llama-batched-bench benchmark. The patch is tested with Meta-Lllama-3-8B, and Mistral-7B models (BF16 models generated by using llama-quantize from corresponding FP32 models) on an IBM POWER10 machine. Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>	2025-05-02 19:53:12 +03:00
Justin Santa Barbara	8efbdadc61	rpc : avoid uninitialized memory in serialize_tensor (#13210 ) Zero out the name and padding buffers.	2025-05-01 23:32:11 +02:00
Jesse Gross	f057808ffa	ggml: Don't assert fail when tensor data changes (#13222 ) The following scenario will cause an assertion failure in the graph allocator: - Build and allocate a graph containing a tensor with a non-NULL data pointer - Build and allocate a new graph where that data is NULL Result: ggml-alloc.c:819: GGML_ASSERT(talloc->buffer_id >= 0) failed This happens during revalidation because we think that memory should have been previously allocated based on the current graph but in reality the previous graph was different. In this situation, we should do a full reallocation pass.	2025-05-01 22:46:10 +02:00
Diego Devesa	d7a14c42a1	build : fix build info on windows (#13239 ) * build : fix build info on windows * fix cuda host compiler msg	2025-05-01 21:48:08 +02:00
Jeff Bolz	79f26e9e12	vulkan: Add bfloat16 support (#12554 ) * vulkan: Add bfloat16 support This adds bfloat16 matrix multiply support based on VK_KHR_shader_bfloat16. The extension is required for coopmat multiply support, but matrix-vector multiply trivially promotes bf16 to fp32 and doesn't require the extension. The copy/get_rows shaders also don't require the extension. It's probably possible to fall back to non-coopmat and promote to fp32 when the extension isn't supported, but this change doesn't do that. The coopmat support also requires a glslc that supports the extension, which currently requires a custom build. * vulkan: Support bf16 tensors without the bf16 extension or coopmat support Compile a variant of the scalar mul_mm shader that will promote the bf16 values to float, and use that when either the bf16 extension or the coopmat extensions aren't available. * vulkan: bfloat16 fixes (really works without bfloat16 support now) * vulkan: fix spirv-val failure and reenable -O	2025-05-01 20:49:39 +02:00

1 2 3 4 5 ...

935 commits