Georgi Gerganov
201b31dc2e
graph : fix geglu ( #14077 )
...
ggml-ci
2025-06-09 17:17:31 +03:00
Đinh Trọng Huy
91a8ee6a6f
add geglu activation function ( #14074 )
...
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-06-09 05:15:31 +01:00
Sigbjørn Skjæret
0974ad7a7c
llama : fix llama_model_chat_template with template name (LLM_KV with suffix) ( #14050 )
2025-06-07 14:13:12 +02:00
Georgi Gerganov
745aa5319b
llama : deprecate llama_kv_self_ API ( #14030 )
...
* llama : deprecate llama_kv_self_ API
ggml-ci
* llama : allow llama_memory_(nullptr)
ggml-ci
* memory : add flag for optional data clear in llama_memory_clear
ggml-ci
2025-06-06 14:11:15 +03:00
Georgi Gerganov
487a5e0401
context : fix SWA-related warning for multiple sequences ( #14045 )
2025-06-06 13:29:18 +03:00
Sigbjørn Skjæret
d17a809ef0
llama : support multiple classifier outputs and labels ( #13940 )
2025-06-06 09:03:25 +02:00
Georgi Gerganov
7f37b6cf1e
memory : migrate from llama_kv_cache to more generic llama_memory ( #14006 )
...
* memory : merge llama_kv_cache into llama_memory + new `llama_memory` API
ggml-ci
* context : fix casts
ggml-ci
2025-06-05 15:29:22 +03:00
Diego Devesa
3a077146a4
llama : allow using mmap without PrefetchVirtualMemory, apply GGML_WIN_VER to llama.cpp sources ( #14013 )
2025-06-05 11:57:42 +02:00
Sigbjørn Skjæret
9f47fa5792
vocab : warn about missing mask token ( #14022 )
2025-06-05 09:29:18 +02:00
Georgi Gerganov
9e31bec4fd
context : fix pos_min initialization upon error decode ( #14008 )
...
ggml-ci
2025-06-05 09:06:29 +03:00
Georgi Gerganov
3e63a58ef7
kv-cache : refactor the update/defrag mechanism ( #13988 )
...
* kv-cache : refactor update mechanism
ggml-ci
* memory : improve status handling
* defrag : reset head + add comments
ggml-ci
* cont : minor fixes
ggml-ci
2025-06-04 18:58:20 +03:00
Xuan-Son Nguyen
3ac67535c8
llama-graph : use ggml_repeat_4d ( #13998 )
2025-06-04 10:11:26 +02:00
Georgi Gerganov
e0e806f52e
kv-cache : fix unified::seq_rm to work with seq_id < 0 ( #13985 )
...
ggml-ci
2025-06-04 09:50:32 +03:00
Georgi Gerganov
5582c49c39
gemma : more consistent attention scaling for v2 and v3 ( #13951 )
...
* gemma : fix attn scale for 27B
* cont : apply scale before attn
* cont : consistent attention scaling
2025-06-02 20:54:26 +03:00
Sigbjørn Skjæret
5e1c3aed40
convert : fix nomic-bert-moe mask token ( #13757 )
2025-06-01 18:07:21 +02:00
Georgi Gerganov
0fc16b42e8
kv-cache : split implementation in separate sources ( #13920 )
...
ggml-ci
2025-06-01 11:39:27 +03:00
Georgi Gerganov
803f8baf4f
llama : deprecate explicit kv_self defrag/update calls ( #13921 )
...
ggml-ci
2025-05-31 15:58:33 +03:00
Georgi Gerganov
3600cc2886
llama : use n_swa + n_ubatch cells for SWA cache ( #13833 )
...
* llama : use n_swa + n_ubatch cells for SWA cache
ggml-ci
* llama : add warning about multi-sqeuence SWA contexts
2025-05-31 15:57:44 +03:00
Georgi Gerganov
3f55f781f1
llama : auto-batch preparation ( #13845 )
...
* llama : auto-batch
ggml-ci
* context : simplify if branching
2025-05-31 12:55:57 +03:00
Georgi Gerganov
12d0188c0d
kv-cache : refactor + add llama_memory_state_i ( #13746 )
...
* kv-cache : simplify the "struct llama_kv_cache" interface
ggml-ci
* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)
ggml-ci
* kv-cache : some comments
ggml-ci
* context : fix graph reserve for multiple sequences
ggml-ci
* kv-cache : fix typo [no ci]
* kv-cache : fix find_slot() logic for free slots
ggml-ci
* llama : add TODO for deprecating the defrag API in the future
* kv-cache : improve find_slot() using min/max seq pos info
ggml-ci
* llama : handle aborts and compute errors
ggml-ci
* memory : extract state into llama_memory_state
ggml-ci
* kv-cache : add comments
ggml-ci
* server : update batching logic to reset n_batch on successful decode
* server : upon full re-processing, remove the sequence from the cache
* kv-cache : add TODO for doing split_equal when split_simple fails
ggml-ci
2025-05-31 10:24:04 +03:00
Đinh Trọng Huy
291f2b6913
llama : add support for DistilBert ( #13907 )
...
* add distilbert
* small fixes
* add note for LLM_ARCH_DISTIL_BERT
* Use MODEL_ARCH.BERT for DistilBert
---------
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-05-30 11:56:02 +02:00
zhangkaihuo
2c90da4c7e
llama : use llm_build_granite for minicpm ( #13911 )
2025-05-30 10:31:48 +02:00
Sigbjørn Skjæret
e83ba3e460
llama : add support for jina-reranker-v2 ( #13900 )
2025-05-29 21:42:31 +02:00
Sigbjørn Skjæret
6385b843a8
llama : add RobertaForSequenceClassification reranker support ( #13875 )
2025-05-29 08:15:01 +02:00
Xuan-Son Nguyen
763d06edb7
llama : fix KV shift for qwen2vl ( #13870 )
...
* llama : fix KV shift for qwen2vl
* add ref to the PR
2025-05-28 22:35:31 +02:00
Đinh Trọng Huy
e0e3aa231d
llama : add support for BertForSequenceClassification reranker ( #13858 )
...
* convert: add support for BertForSequenceClassification
* add support for reranking using BertForSequenceClassification
* merge checks of eos and sep
* fix lint
---------
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-05-28 19:01:58 +02:00
Georgi Gerganov
34b7c0439e
cmake : add llama-cparams.cpp to build ( #13832 )
2025-05-27 19:08:44 +03:00
Georgi Gerganov
81713121ee
kv-cells : track min/max used cells and per-sequence positions ( #13808 )
...
* kv-cells : track min/max used cells and per-sequence positions
ggml-ci
* kv-cells : fix pos-modification updates for seq_pos
ggml-ci
* kv-cells : add comments
ggml-ci
2025-05-27 13:49:41 +03:00
Georgi Gerganov
f9cd68398b
sampling : make sure samplers return at least 1 token ( #13822 )
...
* sampling : min-p should always return at least one token
ggml-ci
* sampling : same for typical sampling
* tests : sampling tests use min_keep == 0
ggml-ci
2025-05-27 12:07:52 +03:00
Georgi Gerganov
4f81b33e32
llama : validate seq id batch input ( #13809 )
...
* llama : validate seq id batch input
ggml-ci
* cont : fix the fix
ggml-ci
2025-05-27 09:40:59 +03:00
Georgi Gerganov
79c137f776
examples : allow extracting embeddings from decoder contexts ( #13797 )
...
ggml-ci
2025-05-26 14:03:54 +03:00
Georgi Gerganov
de2ef53a4b
kv-cache : rework kv_cell ( #13706 )
...
* kv-cache : rework kv_cell
ggml-ci
* kv-cells : use "shift" instead of "delta" consistently
ggml-ci
* llama : add llama_max_parallel_sequences()
ggml-ci
* kv-cells : update comments [no ci]
* context : fail upon construction if sequences exceed max value
ggml-ci
* kv-cells : get_pos() -> pos_get() + comments
ggml-ci
* kv-cells : fix tracking of "used" cells
ggml-ci
2025-05-25 16:34:36 +03:00
Piotr Jasiukajtis
4032ca4066
llama : add support for Qwen3 MoE tied word embeddings ( #13768 )
2025-05-25 10:29:43 +02:00
Olivier Chafik
f5cd27b71d
server
: streaming of tool calls and thoughts when --jinja
is on (#12379 )
...
* add common_json w/ support for truncated json healing
* add common_chat_msg_diff
* partial common_chat_parse
* refactor parser w/ optionals
* server: wire chat diffs in stream mode
* fix trigger of thinking models (must happen after thoughts are closed)
* fix functionary v3.2 raw python!
* rename: common_chat_syntax (now contains format)
* rm common_regex.at_start
* don't return empty <think></think>
* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)
* fix QwQ 32B tool call parsing after thoughts (hermes2)
* better logs for grammar triggers
* consume spaces after parse_json_tool_calls
* fix required tool calls w/ thinking models that have pre-opened thinking tags
* fix thinking model's initial trigger + test qwq's template
* run most test_tool_call tests in stream + non-stream modes
* make functionary v3.2 parsing more strict (differentiate first match from others)
* send final diff from server, to close off raw python arguments
* support partial content streaming in Generic mode
* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)
* Update function-calling.md
* Update tool_bench.py
* chat-parser: remove input from exception (llm output may contain PII)
---------
Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com>
2025-05-25 01:48:08 +01:00
0cc4m
259469c4b5
Move GLM4 f32 attention fix to the correct function ( #13750 )
2025-05-24 16:49:12 +02:00
Sigbjørn Skjæret
c3a2624339
vocab : fix ugm tokenizer precision ( #13743 )
2025-05-24 12:29:09 +02:00
Georgi Gerganov
d13d0f6135
hparams : initialize arrays ( #13728 )
...
ggml-ci
2025-05-23 20:16:13 +03:00
Xuan-Son Nguyen
8a2afb7520
llama : allow custom list of swa_layers ( #13726 )
2025-05-23 17:07:04 +02:00
Georgi Gerganov
8a1d206f1d
tts : fix n_ubatch + make WavTokenizer cache-less ( #13713 )
...
ggml-ci
2025-05-22 22:21:07 +03:00
Georgi Gerganov
8e186ef0e7
hparams : support models for which all layers use SWA ( #13682 )
...
ggml-ci
2025-05-21 20:00:49 +03:00
Georgi Gerganov
797f2ac062
kv-cache : simplify the interface ( #13660 )
...
* kv-cache : simplify the interface
ggml-ci
* context : revert llama_batch_allocr position change
ggml-ci
2025-05-21 15:11:13 +03:00
Georgi Gerganov
b44890df2e
model : disable SWA for Phi models ( #13676 )
...
* model : disable SWA for Phi models
ggml-ci
* model : update warning message
* model : print warning only if n_swa > 0
* model : fix typo
2025-05-21 13:09:21 +03:00
Georgi Gerganov
be0239693c
model : fix llama4 graph ( #13663 )
...
ggml-ci
2025-05-20 19:21:04 +03:00
Georgi Gerganov
a4090d1174
llama : remove llama_kv_cache_view API + remove deprecated ( #13653 )
...
ggml-ci
2025-05-20 16:13:16 +03:00
0cc4m
c9c64dee57
Set GLM4 blk.*.attn_output.weight, kqv_out-* matmul to GGML_PREC_F32 to fix infinity values in output ( #13639 )
2025-05-20 10:11:56 +02:00
Georgi Gerganov
e298d2fbd0
kv-cache : add SWA support ( #13194 )
...
* kv-cache : prepare for SWA
ggml-ci
* kv-cache : initial iSWA implementation
ggml-ci
* kv-cache : rework error recovery logic
ggml-ci
* models : fix Phi-3 SWA parameters
ggml-ci
* model : adjust Granite to rope factor changes
ggml-ci
* server : check if context can do shifts
ggml-ci
* iswa : for now, always enable shifts (experiment)
ggml-ci
* kv-cache : simplify SWA logic
ggml-ci
* kv-cache : apply defrag when we fail to find slots for the batch
ggml-ci
* llama : update docs about llama_decode
ggml-ci
* kv-cache : update warning logs when no space for the batch is available
ggml-ci
* llama : add llama_kv_self_seq_pos_min()
* kv-cache : keep track of partial SWA computes and print warnings
* server : disallow use cases involving partial SWA context
ggml-ci
* llama : add param to control SWA cache size
ggml-ci
* minor : clean-up
ggml-ci
2025-05-20 08:05:46 +03:00
Diego Devesa
5364ae4ba5
llama : print hint when loading a model when no backends are loaded ( #13589 )
2025-05-16 16:38:07 +02:00
Diego Devesa
c6a2c9e741
gguf : use ggml log system ( #13571 )
...
* gguf : use ggml log system
* llama : remove unnecessary new lines in exception messages
2025-05-15 19:13:11 +02:00
Georgi Gerganov
e3a9421b78
kv-cache : fix out-of-bounds view during reserve graph ( #13547 )
...
* kv-cache : fix reserve graph out-of-bounds access
ggml-ci
* cont : add comment
* cont : fix comments [no ci]
* cont : more correct comment [no ci]
2025-05-14 23:15:15 +03:00
Sigbjørn Skjæret
f5170c1d7a
editorconfig : fix trailing whitespace from #13542 ( #13546 )
2025-05-14 21:22:49 +03:00