* clip : refactor set input for cgraph
* more strict assert
* minicpmv : use clip_n_mmproj_embd instead of copying the same code everywhere
* split qwen2 and qwen2.5 code blocks
* minor style fix
* SYCL: Add all missing unary kernels
ggml-ci
* decouple kernel launch range from data size using strided loop
* use ciel_div helper for num_blocks
ggml-ci
* clean auto imported header files
* Add --override-tensors option to llama-bench
* Correct llama-bench --override-tensors to --override-tensor
* llama-bench: Update --override-tensors parsing to match --tensor-split, appear in test matrix.
* Make new llama-bench util functions static to fix Ubuntu CI
* llama-bench: Correct -ot corner cases (No -ot calls, leading and trailing empty -ot spans, etc.)
* fix wrong template in GLM4-0414
* fix spaces
* no bos token since it is already in the template
* moved the chatgml4 check to higher priority
* restored template for old GLM models
* moved the GLM4 template check in the correct place with correct check
* implment vision model architecture, gguf convertor
* handle window attention inputs
* add debug utils
* fix few incorrect tensor memory layout
* move position id remap out of ggml to avoid int32 cuda operations
* cleaning up
* ignore transformers Qwen2_5_xxx type check
* remove not so often use `qwen2vl-cli` debug functions
* remove commented-out code blocks
* fix attn weight scaling after rebase
* add `PROJECTOR_TYPE_QWEN2_5_VL`
* remove `KEY_USE_GLU_MLP`, `KEY_USE_RMS_NORM`
* replace `KEY_FULLATTN_BLK_IDX` with `KEY_WIN_ATTN_PATTERN`
* remove `attn_window_size` from gguf
* fix model conversion
* clean up
* fix merging problem
* add test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* Force FP32 compute in cuBLAS GEMM
* Revert "Force FP32 compute in cuBLAS GEMM"
This reverts commit 6efd872732159ab88ee7b3c1d77ba5ebc83079bd.
* Force F32 compute in GLM4 ffn down
* Edit comment to clarify issue
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.
The performance impact of this change depends on the network latency.
* cmake : do not include ./src as public for libllama
ggml-ci
* cmake : rework tests
ggml-ci
* llguidance : remove unicode include
ggml-ci
* cmake : make c++17 private
ggml-ci
* arg : clean up handling --mmproj with -hf
* rm change about no_mmproj
* Revert "rm change about no_mmproj"
This reverts commit 2cac8e0efb629d66c612f137e75d562f94bb9e6c.
* handle no_mmproj explicitly
* skip download mmproj on examples not using it
* tune matmul for gcn
* this one is more power efficient
* Update ggml/src/ggml-vulkan/ggml-vulkan.cpp
Co-authored-by: 0cc4m <picard12@live.de>
* disable this tune for the proprietary driver
---------
Co-authored-by: 0cc4m <picard12@live.de>
* add pixtral text model (vision is wip)
* cgraph ok, just missing 2D RoPE
* fix bad rebase
* first working version
* fix problem with img_break token
* support dynamic image size
* update docs
* update test script
* mtmd : merge `llava-cli` and `gemma3-cli` into single `mtmd-cli`
* support for minicpmv
* remove cpp files of llava and minicpmv
* update hot topics
* mtmd : add not supported msg for qwen2vl
* Update examples/llava/mtmd.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>