convert : converting mmproj for Qwen2/2.5VL from convert_hf_to_gguf (#13209)

* wip

* qwen2.5vl ok

* vision: fix models missing "text_config"

* add test

* fix test repo name

* fix 32B model

* Revert "fix 32B model"

This reverts commit 651752f1ae25fe8a01c1e57c18cf2eca80b2774e.

* clarify about 32B

* rm qwen surgery script

* update llava/readme

* move V_ENC_EMBD_PATCH handling to Qwen2VLVisionModel
This commit is contained in:
Xuan-Son Nguyen 2025-05-02 17:17:15 +02:00 committed by GitHub
parent c642bc014c
commit 074e42ab31
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
7 changed files with 132 additions and 233 deletions

View file

@ -35,6 +35,16 @@ llama-mtmd-cli -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF
# Pixtral 12B
llama-mtmd-cli -hf ggml-org/pixtral-12b-GGUF
# Qwen 2 VL
llama-mtmd-cli -hf ggml-org/Qwen2-VL-2B-Instruct-GGUF
llama-mtmd-cli -hf ggml-org/Qwen2-VL-7B-Instruct-GGUF
# Qwen 2.5 VL
llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF
llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF
llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-32B-Instruct-GGUF
llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-72B-Instruct-GGUF
# Mistral Small 3.1 24B (IQ2_M quantization)
llama-mtmd-cli -hf ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF --chat-template mistral-v7
```
@ -60,7 +70,17 @@ Built upon `clip.cpp` (similar to `llava.cpp`), `libmtmd` offers several advanta
## How to obtain `mmproj`
Multimodal projector (`mmproj`) files are specific to each model architecture. Please refer to the relevant guide for instructions on how to obtain or create them:
Multimodal projector (`mmproj`) files are specific to each model architecture.
For the following models, you can use `convert_hf_to_gguf.py`with `--mmproj` flag to get the `mmproj` file:
- [Gemma 3](https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d) - Note: 1B variant does not have vision support
- SmolVLM (from [HuggingFaceTB](https://huggingface.co/HuggingFaceTB))
- SmolVLM2 (from [HuggingFaceTB](https://huggingface.co/HuggingFaceTB))
- [Pixtral 12B](https://huggingface.co/mistral-community/pixtral-12b) - only works with `transformers`-compatible checkpoint
- Qwen 2 VL and Qwen 2.5 VL (from [Qwen](https://huggingface.co/Qwen))
- [Mistral Small 3.1 24B](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503)
For older models, please refer to the relevant guide for instructions on how to obtain or create them:
- [LLaVA](../../docs/multimodal/llava.md)
- [MobileVLM](../../docs/multimodal/MobileVLM.md)
@ -70,10 +90,3 @@ Multimodal projector (`mmproj`) files are specific to each model architecture. P
- [MiniCPM-o 2.6](../../docs/multimodal/minicpmo2.6.md)
- [IBM Granite Vision](../../docs/multimodal/granitevision.md)
- [Google Gemma 3](../../docs/multimodal/gemma3.md)
For the following models, you can use `convert_hf_to_gguf.py`with `--mmproj` flag to get the `mmproj` file:
- [Gemma 3](https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d) - Note: 1B variant does not have vision support
- SmolVLM (from [HuggingFaceTB](https://huggingface.co/HuggingFaceTB))
- SmolVLM2 (from [HuggingFaceTB](https://huggingface.co/HuggingFaceTB))
- [Pixtral 12B](https://huggingface.co/mistral-community/pixtral-12b) - only works with `transformers`-compatible checkpoint
- [Mistral Small 3.1 24B](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503)