Commit graph

5576 commits

Author SHA1 Message Date
Olivier Chafik
c9bbc77931
server: update deepseek reasoning format (pass reasoning_content as diffs) (#13933)
* server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat
* update unit/test_tool_call.py::test_thoughts
2025-06-02 10:15:44 -07:00
Xuan-Son Nguyen
bfd322796c
mtmd : fix memory leak in mtmd_helper_eval_chunk_single (#13961)
* mtmd : fix memory in mtmd_helper_eval_chunk_single

* mtmd-cli : fix mem leak

* Update tools/mtmd/mtmd-cli.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-02 16:29:28 +02:00
shalinib-ibm
093e3f1feb
cmake : Handle mixed-case 'Power' strings in POWER CPU detection (#13966)
Some systems report the CPU implementation as "Power11" instead of "POWER11".
The existing CMake logic uses a case-sensitive regular expression to extract
the CPU generation, which fails when the casing doesn't exactly match "POWER".

This patch provides a fix by first converting the string to uppercase before applying the regex.

Signed-off-by: root <root@rheldb2v.pperf.tadn.ibm.com>
Co-authored-by: root <root@rheldb2v.pperf.tadn.ibm.com>
2025-06-02 15:18:36 +03:00
Atharva Dubey
663445b0de
sycl: quantize and reorder the input to q8_1 when reorder is enabled (#13826)
* [WIP]: fuse q8 quantization and reorder

* wip2: fuse q8 quantization and reorder

* working q8 reorder commit

* restored common.hpp

* remove debug prints

* remove unnecessary headers and remove trailing whitespace

* Update ggml/src/ggml-sycl/ggml-sycl.cpp

Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@intel.com>

---------

Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@intel.com>
2025-06-02 10:12:20 +01:00
Johannes Gäßler
7675c555a1
gguf: fix failure on version == 0 (#13956) 2025-06-01 18:08:05 +02:00
Sigbjørn Skjæret
5e1c3aed40
convert : fix nomic-bert-moe mask token (#13757) 2025-06-01 18:07:21 +02:00
Sigbjørn Skjæret
c496fe0b1d
convert : fix vocab padding code for bert models (#13954) 2025-06-01 17:23:11 +02:00
Aaron Teo
e57bb87ced
ggml: check if non-native endian model is being loaded (#13943)
* gguf: prevent non-native endian models from being loaded

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* gguf: update error message

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* gguf: make the non-native endian check more verbose

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: move ggml_assert location

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: reword the endianness check error message

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-06-01 16:53:57 +02:00
Georgi Gerganov
f3a4b1659c sync : ggml
ggml-ci
2025-06-01 13:43:57 +03:00
Kai Pastor
108009f5c7 vulkan : Remove unexpected ; (ggml/1253) 2025-06-01 13:43:57 +03:00
Kai Pastor
d337252acf cmake : Fix broken CMake error messages (ggml/1252) 2025-06-01 13:43:57 +03:00
Radoslav Gerganov
af6f91db47 ggml : remove ggml_graph_import and ggml_graph_export declarations (ggml/1247)
The implementation is already deleted with commit 9d0762e.

closes: #1235
2025-06-01 13:43:57 +03:00
Georgi Gerganov
a7b8d35f78 sync : whisper.cpp (ggml/1250)
* ggml : Fix backtrace breaking Windows build (whisper/3203)

* sync : whisper.cpp

ggml-ci

---------

Co-authored-by: Daniel Tang <danielzgtg.opensource@gmail.com>
2025-06-01 13:43:57 +03:00
Radoslav Gerganov
6eba72b71c ggml : install dynamic backends (ggml/1240)
* ggml : install dynamic backends

Make sure dynamic backends are installed in $CMAKE_INSTALL_BINDIR
2025-06-01 13:43:57 +03:00
Daniel Tang
fedf034a98 ggml : Print backtrace on uncaught C++ exceptions (ggml/1232)
The goal is to have what users call "full logs" contain the backtrace.

This is registered upon ggml_init. Also fixes a minor fd leak on Linux.
2025-06-01 13:43:57 +03:00
ddh0
8726392d3d
readme : update bindings (#13950) 2025-06-01 11:44:30 +03:00
Georgi Gerganov
c04621711a
parallel : fix n_junk == 0 (#13952) 2025-06-01 11:42:16 +03:00
Georgi Gerganov
0fc16b42e8
kv-cache : split implementation in separate sources (#13920)
ggml-ci
2025-06-01 11:39:27 +03:00
Max Krasnyansky
053b1539c0
threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling (#12995)
* threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling

We talked about adding LOW priority for GGML threads in the original threadpool PR.
It might be useful for some cases to avoid contention.

Latest Windows ARM64 releases started parking (offlining) the CPU cores
more aggresively which results in suboptimal performance with n_threads > 4.
To deal with that we now disable Power Throttling for our threads for the NORMAL
and higher priorities.

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* threading: disable SetThreadInfo() calls for older Windows versions

* Update tools/llama-bench/llama-bench.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-05-31 15:39:19 -07:00
Jiří Podivín
b3a89c3d9e
docs : Note about necessity of having libcurl installed for standard build. (#13945)
Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
2025-05-31 18:58:35 +02:00
Olivier Chafik
e15898d1c7
server: allow unclosed thinking tags (#13931) 2025-05-31 08:26:10 -07:00
Georgi Gerganov
803f8baf4f
llama : deprecate explicit kv_self defrag/update calls (#13921)
ggml-ci
2025-05-31 15:58:33 +03:00
Georgi Gerganov
3600cc2886
llama : use n_swa + n_ubatch cells for SWA cache (#13833)
* llama : use n_swa + n_ubatch cells for SWA cache

ggml-ci

* llama : add warning about multi-sqeuence SWA contexts
2025-05-31 15:57:44 +03:00
igardev
c7e0a2054b
webui : Replace alert and confirm with custom modals. (#13711)
* Replace alert and confirm with custom modals. This is needed as Webview in VS Code doesn't permit alert and confirm for security reasons.

* use Modal Provider to simplify the use of confirm and alert modals.

* Increase the z index of the modal dialogs.

* Update index.html.gz

* also add showPrompt

* rebuild

---------

Co-authored-by: igardev <ivailo.gardev@akros.ch>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-31 11:56:08 +02:00
Georgi Gerganov
3f55f781f1
llama : auto-batch preparation (#13845)
* llama : auto-batch

ggml-ci

* context : simplify if branching
2025-05-31 12:55:57 +03:00
Xuan-Son Nguyen
51fa76f172
mtmd : drop _shared from libmtmd name, merge helpers into libmtmd (⚠️ breaking change) (#13917)
* mtmd : fix missing public header

* no object

* apply suggestion from Georgi

* rm mtmd-helper, merge it to mtmd

* missing vendor include dir
2025-05-31 10:14:29 +02:00
Georgi Gerganov
12d0188c0d
kv-cache : refactor + add llama_memory_state_i (#13746)
* kv-cache : simplify the "struct llama_kv_cache" interface

ggml-ci

* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)

ggml-ci

* kv-cache : some comments

ggml-ci

* context : fix graph reserve for multiple sequences

ggml-ci

* kv-cache : fix typo [no ci]

* kv-cache : fix find_slot() logic for free slots

ggml-ci

* llama : add TODO for deprecating the defrag API in the future

* kv-cache : improve find_slot() using min/max seq pos info

ggml-ci

* llama : handle aborts and compute errors

ggml-ci

* memory : extract state into llama_memory_state

ggml-ci

* kv-cache : add comments

ggml-ci

* server : update batching logic to reset n_batch on successful decode

* server : upon full re-processing, remove the sequence from the cache

* kv-cache : add TODO for doing split_equal when split_simple fails

ggml-ci
2025-05-31 10:24:04 +03:00
Shawn yang
eb3949938e
CUDA: add a prop in ggml_cuda_device_infor for distinguish iGPU or dGPU in cuda (#13856) (#13895)
* 1.  add "integrated" in ggml_cuda_device_info for distinguish whether it is Intergrate_gpu or discrete_gpu
2. Adjust the func:"ggml_backend_cuda_device_supports_buft" for this new feature

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted code indentation

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Fixed incorrect setting of variable types

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted the judgment logic

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* add a host_buft assert in case of integrated_cuda_device with func:'evaluate_and_capture_cuda_graph()'

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Add a defensive security assert

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted the support judgment logic.

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* revoke the suggest commit changes due to it's not applicable in jetson_device

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Add parentheses to enforce operator precedence​

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Fix ci bug: add a spaces

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: yangxiao <yang_xl@tju.edu.cn>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: yangxiao <yangxl_zz@qq.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-05-31 08:48:04 +02:00
Johannes Gäßler
e562eece7c
CUDA: fix typo in FlashAttention code (#13926) 2025-05-30 21:22:03 +02:00
Diego Devesa
b47ab7b8e9
sched : avoid changing cur_copy when a graph is already allocated (#13922) 2025-05-30 18:56:19 +02:00
Georgi Gerganov
dd665cc9d4
parallel : increase the variability of the prompt lengths (#13927)
ggml-ci
2025-05-30 19:38:07 +03:00
Diego Devesa
df0c0c7d02
cuda : prevent using split buffers with 3d/4d matrices (#13919) 2025-05-30 16:37:18 +02:00
Akarshan Biswas
b49a8ff96b
SYCL: Add mrope kernel (#13755)
* SYCL: Add mrope kernel

* feat: Optimize rope operations with vectorization

Uses `sycl::vec` to load and store two elements at a time,
significantly improving performance in `rope_norm`,
`rope_neox`, and `rope_multi`. This reduces the number of memory
accesses and leverages SIMD instructions for faster execution.

* Use ceil_div
2025-05-30 19:40:57 +05:30
Georgi Gerganov
53f925074d
sync : vendor (#13901)
* sync : vendor

ggml-ci

* cont : fix httplib version

ggml-ci

* cont : fix lint

* cont : fix lint

* vendor : move to common folder /vendor

ggml-ci

* cont : fix lint

* cont : move httplib to /vendor + use json_fwd.hpp

ggml-ci

* cont : fix server build

ggml-ci

* cont : add missing headers

ggml-ci

* cont : header clean-up

ggml-ci
2025-05-30 16:25:45 +03:00
Sigbjørn Skjæret
db38704f01
convert : fix rwkv bos/eos token (#13844) 2025-05-30 14:50:43 +02:00
Xuan-Son Nguyen
07e4351ce6
convert : allow partial update to the chkhsh pre-tokenizer list (#13847)
* convert : allow partial update to the chkhsh pre-tokenizer list

* code style

* update tokenizer out

* rm inp/out files for models not having gguf

* fixed hash for glm

* skip nomic-bert-moe test

* Update convert_hf_to_gguf_update.py

* fix minerva-7b hash

* rm redundant import
2025-05-30 12:24:37 +02:00
Đinh Trọng Huy
291f2b6913
llama : add support for DistilBert (#13907)
* add distilbert

* small fixes

* add note for LLM_ARCH_DISTIL_BERT

* Use MODEL_ARCH.BERT for DistilBert

---------

Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-05-30 11:56:02 +02:00
zhangkaihuo
2c90da4c7e
llama : use llm_build_granite for minicpm (#13911) 2025-05-30 10:31:48 +02:00
Christian Kastner
ec9e0301fe
cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (#13890) 2025-05-30 01:28:54 +02:00
Sigbjørn Skjæret
e83ba3e460
llama : add support for jina-reranker-v2 (#13900) 2025-05-29 21:42:31 +02:00
Sigbjørn Skjæret
2b131621e6
gguf-py : add support for sub_type (in arrays) in GGUFWriter add_key_value method (#13561) 2025-05-29 15:36:05 +02:00
Yibo Cai
54a2c7a8cd
arm64: optimize q4_k_q8_k kernel with i8mm (#13886)
This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |   110.12 |   147.83 |    24.36 |    24.28 |
|   128 |    128 |    2 |   121.16 |   172.42 |    46.36 |    47.93 |
|   128 |    128 |    4 |   120.15 |   169.75 |    74.68 |    84.00 |
|   128 |    128 |    8 |   130.97 |   196.81 |    91.04 |   114.74 |
|   128 |    128 |   16 |   131.01 |   196.88 |   101.43 |   135.79 |
|   128 |    128 |   32 |   130.85 |   196.51 |   106.97 |   147.29 |
---------------------------------------------------------------------
```
2025-05-29 14:39:20 +03:00
Christian Kastner
21fcc21ad5
cmake: Factor out CPU architecture detection (#13883)
* cmake: Define function for querying architecture

The tests and results match exactly those of ggml/src/CMakeLists.txt

* Switch arch detection over to new function
2025-05-29 12:50:25 +02:00
Vineel Abhinav
dd8ba93416
ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (#13882)
* F32-Mamba-Seq_Scan-SVE

* Fix formatting

* ggml : missing space

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-05-29 12:18:43 +03:00
Georgi Gerganov
66c92061f5
tests : remove json.hpp from a test (#13880)
ggml-ci
2025-05-29 12:17:16 +03:00
Sigbjørn Skjæret
5ca82fc1d7
convert : workaround for AutoConfig dummy labels (#13881) 2025-05-29 10:00:57 +02:00
Sigbjørn Skjæret
6385b843a8
llama : add RobertaForSequenceClassification reranker support (#13875) 2025-05-29 08:15:01 +02:00
Vineel Abhinav
1b8fb8152d
ggml: aarch64: Implement SVE F32 kernels for vector functions (#13843)
* F32-Mamba-SVE

* F32-Mamba-SVE

* Resolve test errors-1

* Resolve test errors-2

* F32-vec-SVE

* F32-vec-SVE

* F32-vec-SVE
2025-05-29 09:01:33 +03:00
Beinsezii
53ae30640e
gguf-py : fix SafetensorRemote return on undefined size (< 0) (#13841) 2025-05-28 23:50:20 +02:00
Xuan-Son Nguyen
763d06edb7
llama : fix KV shift for qwen2vl (#13870)
* llama : fix KV shift for qwen2vl

* add ref to the PR
2025-05-28 22:35:31 +02:00