llama.cpp/ggml/src/ggml-cpu
Yibo Cai 5ab5d5fb25
arm64: optimize q6_k_q8_k kernel with i8mm (#13519)
This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |    78.52 |   109.18 |    18.63 |    18.88 |
|   128 |    128 |    2 |    84.62 |   123.94 |    34.54 |    36.92 |
|   128 |    128 |    4 |    84.36 |   122.49 |    52.65 |    61.32 |
|   128 |    128 |    8 |    90.52 |   138.87 |    63.46 |    84.41 |
|   128 |    128 |   16 |    90.11 |   138.56 |    71.04 |   101.33 |
|   128 |    128 |   32 |    89.81 |   137.79 |    75.14 |   110.47 |
---------------------------------------------------------------------
```
2025-05-14 21:53:52 +02:00
..
amx ggml : upgrade init_tensor API to return a ggml_status (#11854) 2025-02-28 14:41:47 +01:00
cmake ggml : build backends as libraries (#10256) 2024-11-14 18:04:35 +01:00
kleidiai ggml-cpu: Update KleidiAI to v1.6 and fix include directives (#13509) 2025-05-13 18:02:28 +03:00
llamafile ggml : Enable MMA for BF16 in llamafile_sgemm (#13148) 2025-05-02 19:53:12 +03:00
binary-ops.cpp cpu: de-duplicate some of the operators and refactor (ggml/1144) 2025-03-30 08:33:31 +03:00
binary-ops.h cpu: de-duplicate some of the operators and refactor (ggml/1144) 2025-03-30 08:33:31 +03:00
CMakeLists.txt ggml-cpu: Update KleidiAI to v1.6 and fix include directives (#13509) 2025-05-13 18:02:28 +03:00
common.h cpu: de-duplicate some of the operators and refactor (ggml/1144) 2025-03-30 08:33:31 +03:00
cpu-feats-x86.cpp ggml : add SSE 4.2 and x64 base variant for CPUs without AVX (#12871) 2025-04-21 18:13:51 +02:00
ggml-cpu-aarch64.cpp whisper: remove MSVC warnings pragmas (whisper/3090) 2025-05-07 17:28:36 +03:00
ggml-cpu-aarch64.h ggml : refactor online repacking (#10446) 2024-12-07 14:37:50 +02:00
ggml-cpu-hbm.cpp ggml : refactor online repacking (#10446) 2024-12-07 14:37:50 +02:00
ggml-cpu-hbm.h ggml : refactor online repacking (#10446) 2024-12-07 14:37:50 +02:00
ggml-cpu-impl.h ggml-cpu-impl.h: do not redefine bool on POWER9 (#12856) 2025-04-10 01:00:34 +02:00
ggml-cpu-quants.c arm64: optimize q6_k_q8_k kernel with i8mm (#13519) 2025-05-14 21:53:52 +02:00
ggml-cpu-quants.h ggml : build backends as libraries (#10256) 2024-11-14 18:04:35 +01:00
ggml-cpu-traits.cpp ggml : refactor online repacking (#10446) 2024-12-07 14:37:50 +02:00
ggml-cpu-traits.h ggml : refactor online repacking (#10446) 2024-12-07 14:37:50 +02:00
ggml-cpu.c arm64: optimize q6_k_q8_k kernel with i8mm (#13519) 2025-05-14 21:53:52 +02:00
ggml-cpu.cpp rpc : use backend registry, support dl backends (#13304) 2025-05-04 21:25:43 +02:00
ops.cpp whisper: remove MSVC warnings pragmas (whisper/3090) 2025-05-07 17:28:36 +03:00
ops.h ggml : Depthwise 2D convolution (ggml/1152) 2025-04-24 17:32:47 +03:00
simd-mappings.h ggml : fix ppc64le build (#13176) 2025-04-30 13:17:08 +02:00
unary-ops.cpp cpu: de-duplicate some of the operators and refactor (ggml/1144) 2025-03-30 08:33:31 +03:00
unary-ops.h cpu: de-duplicate some of the operators and refactor (ggml/1144) 2025-03-30 08:33:31 +03:00
vec.cpp whisper: remove MSVC warnings pragmas (whisper/3090) 2025-05-07 17:28:36 +03:00
vec.h cpu: move all the operators into a separate c++ file (except mul_mat) (ggml/1167) 2025-04-07 18:44:17 +03:00