* mtmd : allow multiple modalities at the same time
* refactor mtmd tokenizer
* fix compile
* ok, missing SinusoidsPositionEmbedding
* first working version
* fix style
* more strict validate of n_embd
* refactor if..else to switch
* fix regression
* add test for 3B
* update docs
* fix tokenizing with add_special
* add more tests
* fix test case "huge"
* rm redundant code
* set_position_mrope_1d rm n_tokens
* convert ok, load ok
* warmup ok
* test
* still does not work?
* fix padding
* temporary give up
* fix merge conflict
* build_ultravox()
* rm test
* fix merge conflict
* add necessary mtmd APIs
* first working version (only 4s of audio)
* will this monster compile?
* fix compile
* please compile
* fPIC
* fix windows
* various fixes
* clean up audio_helpers
* fix conversion
* add some debug stuff
* long audio input ok
* adapt the api
* add --audio arg
* final touch UX
* add miniaudio to readme
* fix typo
* refactor kv metadata
* mtmd_default_marker()
The bug caused a crash upon load with venvs created with
--system-site-packages to use
python3-pyside6.qtwidgets=python3-pyside6.qtwidgets=6.6.2-4
from Kubuntu 24.10.
* feat: Add GGUF conversion for granitemoeshared
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: hparam and arch plumbing for granitemoeshared
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Split MoE fused tensors for shared experts in conversion
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: First WIP cut at model arch in cpp
The hparam and architecture plumbing should be correct, but the
implementation of the shared experts seems to still be broken.
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Cleaner (maybe more correct?) splitting for gate/up
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix the input to the shared experts
I had misread that the shared experts take the inputs _before_ the standard
MoE layer and was feeding the output of the MoE to the shared experts.
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Avoid architecture-specific checks for Granite MoE Shared
This is a cleaner way that will allow more flexibility in architecture
strings going forward.
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Split granite architectures out of llm_build_llama
This helps de-clutter the llama-family graph construction and allows
granite to diverge further (in preparation for Granite 4).
NOTE: I removed the granite scale factors from llm_build_deci because they
appear to only be there as copy-paste from llm_build_llama. The HF config
does not seem to set those values:
https://huggingface.co/Deci/DeciLM-7B/blob/main/config.json
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix compiler warning about uninitialized inp_pos
This should not have been reachable, but it warns on some compliers
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Consoladate GraniteMoEShared into GraniteMoE for conversion
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Consolidate GraniteMoEShared into GraniteMoE on the c++ side
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* convert : internvl support
* InternVL3-1B working
* fix regression
* rm mobilevlm from test
* fix conversion
* add test for internvl
* add to list of pre-quant
* restore boi/eoi check
* add clarify comment for norm eps
- gguf-py : remove gguf-py/gguf/scripts/__init__.py because it's not needed
Implicit namespaces are supported since Python 3.3 (https://peps.python.org/pep-0420/),
and the entrypoints in pyproject.toml can directly refer to the main functions.
* Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture
- Adds MoE-based embedding model supporting multilingual embeddings.
- Selects architecture variant based on hyperparameter detection (MoE layers).
- Removes unnecessary subclass initialization checks for clarity.
https://www.nomic.ai/blog/posts/nomic-embed-text-v2
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
* fix tokenizer
* don't rename this tensor
---------
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
* add pixtral text model (vision is wip)
* cgraph ok, just missing 2D RoPE
* fix bad rebase
* first working version
* fix problem with img_break token
* support dynamic image size
* update docs
* update test script
* Merged using squash to remove all noise commit messages
* Force flash attention off for `LLM_ARCH_DEEPSEEK2` - embedding too large
* Removed 3 conts (2x RoPE and 1x RMS-norm)
* Changed to use `<cmath>` instead of `<math.h>`
* Reverted removal of the 3 conts
* Used `reshape` in `llm_graph_context::build_attn_mha()`
* Use `k_pe = ggml_reshape`
* Removed the 3 conts again
* Removed the 3D views of `wk_b` and `wv_b`, and just save and 3D in GGUF
* Removed MQA optimisation from `build_attn_mha()` as no gains now
* Simplified `is_mla` branch in `llm_build_deepseek2()`
* Removed `build_attn_mla` and added `nullptr` to all `build_atnn` calls
* Fixed call to `build_attn` in `llm_build_t5_enc`
* gguf-py : support lazy tensor splitting
Splitting usually involves returning tuples of tensors,
which need to be handled properly to avoid early eager evaluation.
* gguf-py : fix flake8 lint
* add edgellm model arch[conversation feature doesn't work]
* remove output.weight layer for edgellm arch
* [Model] update the name of the model
* update the name of model arch in convert gguf
* [Model] Refarctor the model arch into llama-model
* [Bug] Fix the bug in create attn kv
* [Code] Fix editorconfig erros
* [Code] Remove Trailing whitespace
* [Code] Remove Trailing whitespace
* [Code] Change the order of model arch in list
* [Code] Fix flake8 Lint errors
* Remove trailing white space
* [Code] Remove call in model arch
* Refactor gguf scripts to improve metadata handling
Added contents method to ReaderField class
Added endianess property to GGUFReader class
* update scripts
* fix import
* remove unused import
* attempt to work around flake and pyright errors
* second attempt
* give up, ignore type
* bump version
* apply newbyteorder fixes
Currently self.byte_order is never used.
Actually use it to byteswap read data to
allow reading big endian files on little endian systems
and vice versa.
Now it's possible to convert little-endian model
into a big-endian model and back
on a little-endian system.
* add glm edge chat model
* use config partial_rotary_factor as rope ratio
* support for glm edge model
* vision model support
* remove debug info
* fix format
* llava.cpp trailing whitespace
* remove unused AutoTokenizer
* Update src/llama.cpp for not contain <|end|> or </s>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* add edge template
* fix chat template
* fix confict
* fix confict
* fix ci err
* fix format err
* fix template err
* 9b hf chat support
* format
* format clip.cpp
* fix format
* Apply suggestions from code review
* Apply suggestions from code review
* Update examples/llava/clip.cpp
* fix format
* minor : style
---------
Co-authored-by: liyuhang <yuhang.li@zhipuai.cn>
Co-authored-by: piDack <pcdack@hotmail.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: liyuhang <yuhang.li@aminer.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Moved scripts dir and fixed pyproject.toml
* updated readme
* fixed README urls
* bump pypi gguf to v0.14.0
* retrigger ci
* empty commit - trigger ci
* convert : extend DEEPSEEK2 model architecture to support DeepseekV3ForCausalLM by adding EXPERT_WEIGHTS_NORM and EXPERT_GATING_FUNC model parameters and FFN_EXP_PROBS_B tensor type
* vocab : add DeepSeek V3 pre-tokenizer regexes
* unicode : handle ACCENT_MARK and SYMBOL categories in regex
* llama : add DeepSeek V3 chat template, handle new model parameters and tensor types
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>