Sheldon Robinson
90e4dba461
Fix #11802 : Compile bug - RegQueryValueExA changed to RegQueryValueEx ( #11803 )
...
* Fix #11802 : Compile bug - RegQueryValueExA changed to RegQueryValueEx
* Fix #11802 : PR #11803 - keep RegQueryValueExA, remove TEXT macro, description needs to be ANSI string
2025-02-11 16:55:45 +01:00
Karol Kontny
4d3465c5ae
ggml: Fix data race in ggml threadpool ( #11736 )
...
After the barrier in last iteration is executed, still the loop termination
condition will be executed. However main thread can destroy the cgraph object
and its nodes already, then another thread will access it, but the thing is already gone.
Also trouble can happen when n_nodes == 0 or abort is called, but I'm not sure if the
prior situation is possible.
Last syncronization should be done after the loop to ensure the cgraph/cplan won't be
accessed after the main thread exits from the function.
2025-02-08 15:30:53 +01:00
Jinyang He
225bbbfa39
ggml : optimize and build warning fix for LoongArch ( #11709 )
...
* ggml : optimize convert f32<->f16 for loongarch_asx
* ggml : optimize loongarch_asx extend i16,i8,u8 to i32,i16
* ggml : Fix warnings when run cpu CI locally on LoongArch
2025-02-07 09:38:31 +02:00
junchao-zhao
8d4d2be143
ggml : fix LoongArch compile error with 128-bit SIMD ( #11701 )
2025-02-06 11:20:00 +02:00
issixx
d2e518e9b4
ggml-cpu : fix ggml_graph_compute_thread did not terminate on abort. (ggml/1065)
...
some threads kept looping and failed to terminate properly after an abort during CPU execution.
Co-authored-by: issi <issi@gmail.com>
2025-01-29 11:24:51 +02:00
Johannes Gäßler
8137b4bb2b
CPU/CUDA: fix (GQA) mul mat back, add CUDA support ( #11380 )
2025-01-24 12:38:31 +01:00
Jeff Bolz
bd38ddea01
vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl ( #11166 )
...
* vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl
Shaders are based on cpy.cu.
* vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32
* ggml: copy q->f32 assumes some contiguity in the destination
2025-01-16 22:47:10 +01:00
Johannes Gäßler
9c8dcefe17
CUDA: backwards pass for misc. ops, add tests ( #11257 )
...
* CUDA: backwards pass for misc. ops, add tests
* remove restrict from pointers
2025-01-16 16:43:38 +01:00
fj-y-saito
c67cc9837d
ggml: aarch64: implement SVE kernels for q4_K_q8_K vector dot ( #11227 )
...
* Add SVE support for q4_K_q8_K
* Update ggml/src/ggml-cpu/ggml-cpu-quants.c
change to use K_SCALE_SIZE
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-01-16 11:11:49 +02:00
Johannes Gäßler
432df2d5f9
RoPE: fix back, CUDA support for back + noncont. ( #11240 )
...
* RoPE: fix back, CUDA support for back + noncont.
* fix comments reg. non-cont. RoPE support [no-ci]
2025-01-15 12:51:37 +01:00
Molly Sophia
ee7136c6d1
llama: add support for QRWKV6 model architecture ( #11001 )
...
llama: add support for QRWKV6 model architecture (#11001 )
* WIP: Add support for RWKV6Qwen2
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* RWKV: Some graph simplification
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Add support for RWKV6Qwen2 with cpu and cuda GLA
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Fix some typos
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* code format changes
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Fix wkv test & add gla test
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Fix cuda warning
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Update README.md
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Update ggml/src/ggml-cuda/gla.cu
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Fix fused lerp weights loading with RWKV6
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* better sanity check skipping for QRWKV6 in llama-quant
thanks @compilade
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: compilade <git@compilade.net>
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <git@compilade.net>
2025-01-10 09:58:08 +08:00
amritahs-ibm
8cef75c743
llamafile : ppc64le MMA INT8 implementation ( #10912 )
...
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.
This change results in 10% - 70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.
The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.
Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
2025-01-08 12:54:19 +02:00
Diego Devesa
017cc5f446
ggml-backend : only offload from host buffers (fix) ( #11124 )
2025-01-07 16:11:57 +01:00
Srihari-mcw
0827b2c1da
ggml : fixes for AVXVNNI instruction set with MSVC and Clang ( #11027 )
...
* Fixes for clang AVX VNNI
* enable AVX VNNI and alder lake build for MSVC
* Apply suggestions from code review
---------
Co-authored-by: slaren <slarengh@gmail.com>
2024-12-31 15:23:33 +01:00
Djip007
2cd43f4900
ggml : more perfo with llamafile tinyblas on x86_64 ( #10714 )
...
* more perfo with llamafile tinyblas on x86_64.
- add bf16 suport
- change dispache strategie (thanks:
https://github.com/ikawrakow/ik_llama.cpp/pull/71 )
- reduce memory bandwidth
simple tinyblas dispache and more cache freindly
* tinyblas dynamic dispaching
* sgemm: add M blocs.
* - git 2.47 use short id of len 9.
- show-progress is not part of GNU Wget2
* remove not stable test
2024-12-24 18:54:49 +01:00
Diego Devesa
60cfa728e2
ggml : use wstring for backend search paths ( #10960 )
...
ggml-ci
2024-12-24 04:05:27 +01:00
Diego Devesa
3327bb0f8d
ggml : fix arm enabled features check ( #10961 )
2024-12-24 04:05:17 +01:00
Diego Devesa
32d6ee6385
ggml : fix const usage in SSE path ( #10962 )
2024-12-23 20:25:52 +01:00
Adrien Gallouët
e34c5af43f
ggml-cpu: replace NEON asm with intrinsics in ggml_gemv_q4_0_4x8_q8_0() ( #10874 )
...
* ggml-cpu: replace NEON asm with intrinsics in ggml_gemv_q4_0_4x8_q8_0()
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ggml-cpu: format code
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2024-12-21 00:33:37 +01:00
Diego Devesa
21ae3b9be8
ggml : add test for SVE and disable when it fails ( #10906 )
2024-12-20 13:31:28 +01:00
Adrien Gallouët
a3c33b1dce
ggml: fix arm build with gcc ( #10895 )
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2024-12-19 14:20:41 +01:00
Diego Devesa
9177484f58
ggml : fix arm build ( #10890 )
...
* ggml: GGML_NATIVE uses -mcpu=native on ARM
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ggml: Show detected features with GGML_NATIVE
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* remove msvc support, add GGML_CPU_ARM_ARCH option
* disable llamafile in android example
* march -> mcpu, skip adding feature macros
ggml-ci
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Co-authored-by: Adrien Gallouët <angt@huggingface.co>
2024-12-18 23:21:42 +01:00
Georgi Gerganov
78f766768d
cmake : fix "amd64" processor string (whisper/2638)
2024-12-17 18:35:49 +02:00
Georgi Gerganov
0006f5a74a
ggml : update ggml_backend_cpu_device_supports_op ( #10867 )
...
* ggml : fix cpy op for IQ-quants to use reference impl
ggml-ci
* ggml : disable tests involving i-matrix quantization
* ggml : update ggml_backend_cpu_device_supports_op
ggml-ci
2024-12-17 18:35:42 +02:00
HimariO
ba1cb19cdd
llama : add Qwen2VL support + multimodal RoPE ( #10361 )
...
* Barebone Qwen2VL LLM convertor
* Add Qwen2VL cli entrypoint
* [WIP] add qwen2vl arch
* Verify m-rope output
* Add vl-rope/2d-rope support for qwen2vl ViT
* update qwen2vl cli tool
* update 5D tensor op workaround
* [WIP] qwen2vl vision model
* make batch and clip utils compatible with qwen2vl
* [WIP] create inference workflow, gguf convert script but fix
* correcting vision-rope behavior, add the missing last layer back to ViT
* add arg parser to qwen2vl_surgery
* replace variable size array with vector
* cuda-gdb cmake preset
* add fp32 mrope, vision rope kernel
* add fp16 support for qwen2vl and m-rope
* add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION`
* fix rope op mode switching, out dated func args
* update `llama_hparams`
* update to keep up stream changes
* resolve linter, test errors
* add makefile entry, update speical image padding token
* add mrope unit test, fix few compiler warnings
* rename `mrope` related function, params
* minor updates on debug util, bug fixs
* add `m-rope` testcase to `test-backend-ops`
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix traililng whitespce
* store `llama_hparams.rope_sections` with fixed size array
* update position id tensor size check in GGML_OP_ROPE
* minor updates
* update `ggml_backend_*_supports_op` of unsupported backends
* remote old `rope_section` compare operator
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-12-14 14:43:46 +02:00
Karol Kontny
d583cd03f6
ggml : Fix compilation issues on ARM platform when building without fp16 ( #10811 )
2024-12-13 01:04:19 +01:00
Diego Devesa
cb13ef85a4
remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS ( #10797 )
...
other windows build fixes
2024-12-12 19:02:49 +01:00
Georgi Gerganov
d9c3ba2b77
ggml : disable iq4_nl interleave size 8 ( #10709 )
...
ggml-ci
2024-12-07 18:38:15 +02:00
Djip007
19d8762ab6
ggml : refactor online repacking ( #10446 )
...
* rename ggml-cpu-aarch64.c to .cpp
* reformat extra cpu backend.
- clean Q4_0_N_M and IQ4_0_N_M
- remove from "file" tensor type
- allow only with dynamic repack
- extract cpu extra bufts and convert to C++
- hbm
- "aarch64"
- more generic use of extra buffer
- generalise extra_supports_op
- new API for "cpu-accel":
- amx
- aarch64
* clang-format
* Clean Q4_0_N_M ref
Enable restrict on C++
* add op GGML_OP_MUL_MAT_ID for Q4_0_N_M with runtime repack
* added/corrected control on tensor size for Q4 repacking.
* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add debug logs on repacks.
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-12-07 14:37:50 +02:00
PAB
a8cbab201d
ggml: add GGML_SET
Metal kernel + i32 CPU kernel (ggml/1037)
...
* implemented cpu kernel
* add i32 test cases in test-backend-ops
* typedef `ggml_metal_kargs_set`
* implemented `kernel_set`
* memcpy
2024-12-05 13:27:33 +02:00
PAB
c2082d93a8
ggml : add GGML_PAD_REFLECT_1D
operation (ggml/1034)
...
* ggml_pad_reflect_1d defined in header
* implemented on CPU
* called the forward pass
* impl Metal kernel
* added Metal kernel
* added OP_PAD_REFLECT_1D in test-backend-ops.cpp
* add test-pad-reflect-1d test case
* test case support multiple backend
2024-12-05 13:27:31 +02:00
Diego Devesa
59f4db1088
ggml : add predefined list of CPU backend variants to build ( #10626 )
...
* ggml : add predefined list of CPU backend variants to build
* update CPU dockerfiles
2024-12-04 14:45:40 +01:00
Diego Devesa
2803540814
ggml-cpu : fix HWCAP2_I8MM value ( #10646 )
2024-12-04 14:40:44 +01:00
Diego Devesa
3420909dff
ggml : automatic selection of best CPU backend ( #10606 )
...
* ggml : automatic selection of best CPU backend
* amx : minor opt
* add GGML_AVX_VNNI to enable avx-vnni, fix checks
2024-12-01 16:12:41 +01:00
Adrien Gallouët
0c39f44d70
ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() ( #10567 )
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2024-11-30 09:13:18 -08:00
Diego Devesa
7cc2d2c889
ggml : move AMX to the CPU backend ( #10570 )
...
* ggml : move AMX to the CPU backend
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-29 21:54:58 +01:00
Georgi Gerganov
f0678c5ff4
ggml : fix I8MM Q4_1 scaling factor conversion ( #10562 )
...
ggml-ci
2024-11-29 16:25:39 +02:00
Shupei Fan
4b3242bbea
ggml-cpu: fix typo in gemv/gemm iq4_nl_4_4 ( #10580 )
2024-11-29 14:49:02 +01:00
Georgi Gerganov
dc22344088
ggml : remove redundant copyright notice + update authors
2024-11-28 20:46:40 +02:00
Georgi Gerganov
76b27d29c2
ggml : fix row condition for i8mm kernels ( #10561 )
...
ggml-ci
2024-11-28 14:56:37 +02:00
Georgi Gerganov
eea986f215
cmake : fix ARM feature detection ( #10543 )
...
ggml-ci
2024-11-28 14:56:23 +02:00
Shupei Fan
c202cef168
ggml-cpu: support IQ4_NL_4_4 by runtime repack ( #10541 )
...
* ggml-cpu: support IQ4_NL_4_4 by runtime repack
* ggml-cpu: add __ARM_FEATURE_DOTPROD guard
2024-11-28 13:52:03 +01:00
Charles Xu
25669aa92c
ggml-cpu: cmake add arm64 cpu feature check for macos ( #10487 )
...
* ggml-cpu: cmake add arm64 cpu feature check for macos
* use vmmlaq_s32 for compile option i8mm check
2024-11-26 13:37:05 +02:00
Diego Devesa
5931c1f233
ggml : add support for dynamic loading of backends ( #10469 )
...
* ggml : add support for dynamic loading of backends
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-25 15:13:39 +01:00
Diego Devesa
55ed008b2d
ggml : do not use ARM features not included in the build ( #10457 )
2024-11-23 14:41:12 +01:00
haopeng
42ae10bbcd
add cmake rvv support ( #10411 )
2024-11-19 21:10:31 +01:00
FirstTimeEZ
a43178299c
ggml : fix undefined reference to 'getcpu' ( #10354 )
...
https://github.com/ggerganov/llama.cpp/issues/10352
2024-11-17 10:39:22 +02:00
Johannes Gäßler
8a43e940ab
ggml: new optimization interface (ggml/988)
2024-11-17 08:30:29 +02:00
Georgi Gerganov
db4cfd5dbc
llamafile : fix include path ( #0 )
...
ggml-ci
2024-11-16 20:36:26 +02:00
Dan Johansson
1e58ee1318
ggml : optimize Q4_0 into Q4_0_X_Y repack ( #10324 )
2024-11-16 01:53:37 +01:00