llama.cpp

History

Yibo Cai 5ab5d5fb25 arm64: optimize q6_k_q8_k kernel with i8mm (#13519 ) This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction. Tested on neoverse-n2 with llama3 8b q6_k quantization model. - 40% ~ 54% S_PP uplift for all batch sizes - 16% ~ 47% S_TG uplift for batch size 4 and above Perplexity doesn't change with this PR. ``` // tested on neoverse-n2 $ llama-batched-bench \ -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \ --no-mmap -fa \ -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \ -npl 1,2,4,8,16,32 \ -t 64 --------------------------------------------------------------------- \| PP \| TG \| B \| S_PP t/s \| S_TG t/s \| \| \| \| \| original \| this pr \| original \| this pr \| \|-------\|--------\|------\|----------\|----------\|----------\|----------\| \| 128 \| 128 \| 1 \| 78.52 \| 109.18 \| 18.63 \| 18.88 \| \| 128 \| 128 \| 2 \| 84.62 \| 123.94 \| 34.54 \| 36.92 \| \| 128 \| 128 \| 4 \| 84.36 \| 122.49 \| 52.65 \| 61.32 \| \| 128 \| 128 \| 8 \| 90.52 \| 138.87 \| 63.46 \| 84.41 \| \| 128 \| 128 \| 16 \| 90.11 \| 138.56 \| 71.04 \| 101.33 \| \| 128 \| 128 \| 32 \| 89.81 \| 137.79 \| 75.14 \| 110.47 \| --------------------------------------------------------------------- ```		2025-05-14 21:53:52 +02:00
..
amx	ggml : upgrade init_tensor API to return a ggml_status (#11854 )	2025-02-28 14:41:47 +01:00
cmake	ggml : build backends as libraries (#10256 )	2024-11-14 18:04:35 +01:00
kleidiai	ggml-cpu: Update KleidiAI to v1.6 and fix include directives (#13509 )	2025-05-13 18:02:28 +03:00
llamafile	ggml : Enable MMA for BF16 in llamafile_sgemm (#13148 )	2025-05-02 19:53:12 +03:00
binary-ops.cpp	cpu: de-duplicate some of the operators and refactor (ggml/1144)	2025-03-30 08:33:31 +03:00
binary-ops.h	cpu: de-duplicate some of the operators and refactor (ggml/1144)	2025-03-30 08:33:31 +03:00
CMakeLists.txt	ggml-cpu: Update KleidiAI to v1.6 and fix include directives (#13509 )	2025-05-13 18:02:28 +03:00
common.h	cpu: de-duplicate some of the operators and refactor (ggml/1144)	2025-03-30 08:33:31 +03:00
cpu-feats-x86.cpp	ggml : add SSE 4.2 and x64 base variant for CPUs without AVX (#12871 )	2025-04-21 18:13:51 +02:00
ggml-cpu-aarch64.cpp	whisper: remove MSVC warnings pragmas (whisper/3090)	2025-05-07 17:28:36 +03:00
ggml-cpu-aarch64.h	ggml : refactor online repacking (#10446 )	2024-12-07 14:37:50 +02:00
ggml-cpu-hbm.cpp	ggml : refactor online repacking (#10446 )	2024-12-07 14:37:50 +02:00
ggml-cpu-hbm.h	ggml : refactor online repacking (#10446 )	2024-12-07 14:37:50 +02:00
ggml-cpu-impl.h	ggml-cpu-impl.h: do not redefine bool on POWER9 (#12856 )	2025-04-10 01:00:34 +02:00
ggml-cpu-quants.c	arm64: optimize q6_k_q8_k kernel with i8mm (#13519 )	2025-05-14 21:53:52 +02:00
ggml-cpu-quants.h	ggml : build backends as libraries (#10256 )	2024-11-14 18:04:35 +01:00
ggml-cpu-traits.cpp	ggml : refactor online repacking (#10446 )	2024-12-07 14:37:50 +02:00
ggml-cpu-traits.h	ggml : refactor online repacking (#10446 )	2024-12-07 14:37:50 +02:00
ggml-cpu.c	arm64: optimize q6_k_q8_k kernel with i8mm (#13519 )	2025-05-14 21:53:52 +02:00
ggml-cpu.cpp	rpc : use backend registry, support dl backends (#13304 )	2025-05-04 21:25:43 +02:00
ops.cpp	whisper: remove MSVC warnings pragmas (whisper/3090)	2025-05-07 17:28:36 +03:00
ops.h	ggml : Depthwise 2D convolution (ggml/1152)	2025-04-24 17:32:47 +03:00
simd-mappings.h	ggml : fix ppc64le build (#13176 )	2025-04-30 13:17:08 +02:00
unary-ops.cpp	cpu: de-duplicate some of the operators and refactor (ggml/1144)	2025-03-30 08:33:31 +03:00
unary-ops.h	cpu: de-duplicate some of the operators and refactor (ggml/1144)	2025-03-30 08:33:31 +03:00
vec.cpp	whisper: remove MSVC warnings pragmas (whisper/3090)	2025-05-07 17:28:36 +03:00
vec.h	cpu: move all the operators into a separate c++ file (except mul_mat) (ggml/1167)	2025-04-07 18:44:17 +03:00