Georgi Gerganov
81713121ee
kv-cells : track min/max used cells and per-sequence positions ( #13808 )
...
* kv-cells : track min/max used cells and per-sequence positions
ggml-ci
* kv-cells : fix pos-modification updates for seq_pos
ggml-ci
* kv-cells : add comments
ggml-ci
2025-05-27 13:49:41 +03:00
Georgi Gerganov
de2ef53a4b
kv-cache : rework kv_cell ( #13706 )
...
* kv-cache : rework kv_cell
ggml-ci
* kv-cells : use "shift" instead of "delta" consistently
ggml-ci
* llama : add llama_max_parallel_sequences()
ggml-ci
* kv-cells : update comments [no ci]
* context : fail upon construction if sequences exceed max value
ggml-ci
* kv-cells : get_pos() -> pos_get() + comments
ggml-ci
* kv-cells : fix tracking of "used" cells
ggml-ci
2025-05-25 16:34:36 +03:00
Georgi Gerganov
797f2ac062
kv-cache : simplify the interface ( #13660 )
...
* kv-cache : simplify the interface
ggml-ci
* context : revert llama_batch_allocr position change
ggml-ci
2025-05-21 15:11:13 +03:00
Georgi Gerganov
a4090d1174
llama : remove llama_kv_cache_view API + remove deprecated ( #13653 )
...
ggml-ci
2025-05-20 16:13:16 +03:00
Georgi Gerganov
e298d2fbd0
kv-cache : add SWA support ( #13194 )
...
* kv-cache : prepare for SWA
ggml-ci
* kv-cache : initial iSWA implementation
ggml-ci
* kv-cache : rework error recovery logic
ggml-ci
* models : fix Phi-3 SWA parameters
ggml-ci
* model : adjust Granite to rope factor changes
ggml-ci
* server : check if context can do shifts
ggml-ci
* iswa : for now, always enable shifts (experiment)
ggml-ci
* kv-cache : simplify SWA logic
ggml-ci
* kv-cache : apply defrag when we fail to find slots for the batch
ggml-ci
* llama : update docs about llama_decode
ggml-ci
* kv-cache : update warning logs when no space for the batch is available
ggml-ci
* llama : add llama_kv_self_seq_pos_min()
* kv-cache : keep track of partial SWA computes and print warnings
* server : disallow use cases involving partial SWA context
ggml-ci
* llama : add param to control SWA cache size
ggml-ci
* minor : clean-up
ggml-ci
2025-05-20 08:05:46 +03:00
Georgi Gerganov
e3a9421b78
kv-cache : fix out-of-bounds view during reserve graph ( #13547 )
...
* kv-cache : fix reserve graph out-of-bounds access
ggml-ci
* cont : add comment
* cont : fix comments [no ci]
* cont : more correct comment [no ci]
2025-05-14 23:15:15 +03:00
Georgi Gerganov
c642bc014c
kv-cache : separate recurrent vs non-recurrent impl ( #12799 )
...
* kv-cache : serparate recurrent vs non-recurrent impl (wip)
ggml-ci
* kv-cache : init -> contructor + add llama_memory_params
ggml-ci
* kv-cache : fix callback reference
ggml-ci
* context : llama_kv_cache -> llama_memory_i
ggml-ci
* context : move memory creation logic to model
ggml-ci
* llama : remove reference of memory during encode
ggml-ci
* kv-cache : hide padding details in the implementation
ggml-ci
* kv-cache : add ubatch_next()
ggml-ci
* context : simplify sbatch logic
ggml-ci
* kv-cache : hide defrag logic in the implementation
ggml-ci
* context : hide kv cache details in implementation
ggml-ci
* build : fix
ggml-ci
* cont : another fix
ggml-ci
* kv-cache : simplify interface (wip)
ggml-ci
* kv-cache : use separate KV cell structs for unified/recurrent
ggml-ci
* kv-cache : clean-up
ggml-ci
* model : better llama_model::create_model() signature
ggml-ci
* kv-cache : fix recurrent seq_rm()
ggml-ci
* kv-cache : replace `struct callbacks` with `llama_model &`
ggml-ci
* kv-cache : replace `struct graph_params` with `llama_context &`
ggml-ci
* kv-cache : fix offload check
ggml-ci
* context : avoid passing unique_ptr
ggml-ci
* kv-cache : avoid using the backends from the llama_context
ref #13113
ggml-ci
* kv-cache : more consistent debug logs [no ci]
* kv-cache : do not pass the full llama_context for kv graphs
ggml-ci
* kv-cache : remove comment
* kv-cache : ggml_rope_ext_inplace -> ggml_rope_ext
ggml-ci
* kv-cache : fix recurrent multi-user case
ggml-ci
* memory : remove comments [no ci]
2025-05-02 17:48:36 +03:00
Juk Armstrong
daa422881a
llama : DeepSeek V2/V3 MLA implementation ( #12801 )
...
* Merged using squash to remove all noise commit messages
* Force flash attention off for `LLM_ARCH_DEEPSEEK2` - embedding too large
* Removed 3 conts (2x RoPE and 1x RMS-norm)
* Changed to use `<cmath>` instead of `<math.h>`
* Reverted removal of the 3 conts
* Used `reshape` in `llm_graph_context::build_attn_mha()`
* Use `k_pe = ggml_reshape`
* Removed the 3 conts again
* Removed the 3D views of `wk_b` and `wv_b`, and just save and 3D in GGUF
* Removed MQA optimisation from `build_attn_mha()` as no gains now
* Simplified `is_mla` branch in `llm_build_deepseek2()`
* Removed `build_attn_mla` and added `nullptr` to all `build_atnn` calls
* Fixed call to `build_attn` in `llm_build_t5_enc`
2025-04-15 09:49:57 +03:00
Georgi Gerganov
3e1d29348b
kv-cache : simplify + fix warning for recurrent models ( #12756 )
...
ggml-ci
2025-04-04 21:48:10 +03:00
Georgi Gerganov
a10b36c91a
llama : refactor kv cache guard ( #12695 )
...
* llama : refactor kv cache guard
ggml-ci
* cont : fix comment [no ci]
* llama : fix kv_cache restore logic
ggml-ci
* context : simplify kv cache updates
ggml-ci
* cont : better name [no ci]
* llama : fix llama_decode return code when could not find KV slot
ggml-ci
* context : change log err -> warn [no ci]
* kv-cache : add comment + warning
2025-04-02 14:32:59 +03:00
Georgi Gerganov
e0dbec0bc6
llama : refactor llama_context, llama_kv_cache, llm_build_context ( #12181 )
...
* llama : refactor llama_context, llama_kv_cache, llm_build_context
ggml-ci
* graph : don't mutate the KV cache during defrag
ggml-ci
* context : reduce virtuals + remove test function
ggml-ci
* context : move interface implementation to source file + factory
ggml-ci
* graph : move KV cache build functions to llama_context impl
ggml-ci
* graph : remove model reference from build_pooling
ggml-ci
* graph : remove llama_model reference
ggml-ci
* kv_cache : provide rope factors
ggml-ci
* graph : rework inputs to use only unique_ptr, remove attn input abstraction
ggml-ci
* context : remove llama_context_i abstraction
ggml-ci
* context : clean-up
ggml-ci
* graph : clean-up
ggml-ci
* llama : remove redundant keywords (struct, enum)
ggml-ci
* model : adapt gemma3
ggml-ci
* graph : restore same attention ops as on master
ggml-ci
* llama : remove TODO + fix indent
ggml-ci
2025-03-13 12:35:44 +02:00
Georgi Gerganov
afa8a9ec9b
llama : add llama_vocab
, functions -> methods, naming ( #11110 )
...
* llama : functions -> methods (#11110 )
* llama : add struct llama_vocab to the API (#11156 )
ggml-ci
* hparams : move vocab params to llama_vocab (#11159 )
ggml-ci
* vocab : more pimpl (#11165 )
ggml-ci
* vocab : minor tokenization optimizations (#11160 )
ggml-ci
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* lora : update API names (#11167 )
ggml-ci
* llama : update API names to use correct prefix (#11174 )
* llama : update API names to use correct prefix
ggml-ci
* cont
ggml-ci
* cont
ggml-ci
* minor [no ci]
* vocab : llama_vocab_add_[be]os -> llama_vocab_get_add_[be]os (#11174 )
ggml-ci
* vocab : llama_vocab_n_vocab -> llama_vocab_n_tokens (#11174 )
ggml-ci
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-01-12 11:32:42 +02:00
Daniel Bevenius
6369f867a4
llama : rename missed batch params/vars to ubatch ( #10059 )
...
This commit renames the `batch` parameter to `ubatch` in the
`llama_kv_cache_find_slot`, `llm_build_inp_embd`, and
`llm_build_mamba` functions.
The motivation for this is that this should have been done as part of
Commit 19d900a756
("llama : rename batch
to ubatch (#9950 )") but for some reason I missed these functions in
that commit and only noticed them now (sorry).
2025-01-06 11:28:17 +02:00
Georgi Gerganov
f66f582927
llama : refactor src/llama.cpp
( #10902 )
...
* llama : scatter llama.cpp into multiple modules (wip)
* llama : control-vector -> adapter
* llama : arch
* llama : mmap
ggml-ci
* ci : remove BUILD_SHARED_LIBS=OFF
ggml-ci
* llama : arch (cont)
ggml-ci
* llama : chat
ggml-ci
* llama : model
ggml-ci
* llama : hparams
ggml-ci
* llama : adapter
ggml-ci
* examples : fix
ggml-ci
* rebase
ggml-ci
* minor
* llama : kv cache
ggml-ci
* llama : impl
ggml-ci
* llama : batch
ggml-ci
* cont
ggml-ci
* llama : context
ggml-ci
* minor
* llama : context (cont)
ggml-ci
* llama : model loader
ggml-ci
* common : update lora
ggml-ci
* llama : quant
ggml-ci
* llama : quant (cont)
ggml-ci
* minor [no ci]
2025-01-03 10:18:53 +02:00