Alberto Cabrera Pérez
725f23f1f3
sycl : backend documentation review ( #13544 )
...
* sycl: reviewing and updating docs
* Updates Runtime error codes
* Improves OOM troubleshooting entry
* Added a llama 3 sample
* Updated supported models
* Updated releases table
2025-05-19 14:38:20 +01:00
Xuan-Son Nguyen
92ecdcc06a
mtmd : add vision support for llama 4 ( #13282 )
...
* wip llama 4 conversion
* rm redundant __init__
* fix conversion
* fix conversion
* test impl
* try this
* reshape patch_embeddings_0
* fix view
* rm ffn_post_norm
* cgraph ok
* f32 for pos embd
* add image marker tokens
* Llama4UnfoldConvolution
* correct pixel shuffle
* fix merge conflicts
* correct
* add debug_graph
* logits matched, but it still preceives the image incorrectly
* fix style
* add image_grid_pinpoints
* handle llama 4 preprocessing
* rm load_image_size
* rm unused line
* fix
* small fix 2
* add test & docs
* fix llava-1.6 test
* test: add notion of huge models
* add comment
* add warn about degraded quality
2025-05-19 13:04:14 +02:00
Alberto Cabrera Pérez
f71f40a284
ci : upgraded oneAPI version in SYCL workflows and dockerfile ( #13532 )
2025-05-19 11:46:09 +01:00
Georgi Gerganov
d30cb5a7fa
sync : ggml
...
ggml-ci
2025-05-19 13:29:56 +03:00
Johannes Gäßler
6c35981a64
mnist: fix segmentation fault (ggml/1227)
2025-05-19 13:29:56 +03:00
Diego Devesa
8b5e19aea6
ggml : fix apple OS check in ggml_print_backtrace (ggml/1229)
2025-05-19 13:29:56 +03:00
Daniel Tang
60aea028b5
ggml : Fix missing backtrace on Linux (ggml/1228)
...
* Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1
* Fixed lldb attach
* Simplify by having the child do ggml_print_backtrace_symbols
2025-05-19 13:29:56 +03:00
Nick
9c55e5c5c2
fix: check model pointer validity before use ( #13631 )
2025-05-19 13:25:41 +03:00
Chenguang Li
33d7aed4a8
CANN: Support MOE Model MUL_MAT_ID ( #13042 )
...
Signed-off-by: noemotiovon <757486878@qq.com>
2025-05-19 14:21:17 +08:00
Isaac McFadyen
6a2bc8bfb7
server : added --no-prefill-assistant flag ( #13608 )
...
* added no-prefill-assistant flag
* reworded documentation comment
* updated server README.md
2025-05-17 23:59:48 +02:00
Gilad S.
e3a7cf6c5b
cmake: use the current build config for vulkan-shaders-gen ( #13595 )
...
* fix: use the current build config for `vulkan-shaders-gen`
* fix: only pass a valid build type to `--config`
2025-05-17 15:26:43 -03:00
Georgi Gerganov
518329b2d4
parallel : add option for non-shared and larger prompts ( #13598 )
...
* parallel : add option for non-shared and larger prompts
* parallel : update readme [no ci]
* cont : add note about base models [no ci]
* parallel : better var name
ggml-ci
2025-05-17 12:58:55 +03:00
Jeff Bolz
2f5a4e1e09
vulkan: move common FA code to flash_attn_base.comp ( #13556 )
...
* vulkan: move common FA code to flash_attn_base.comp
* vulkan: move common FA index/stride setup code to flash_attn_base.comp
* build fix
2025-05-17 09:14:55 +02:00
Jeff Bolz
4f41ee11d6
vulkan: use scalar FA rather than coopmat2 when N==1 ( #13554 )
2025-05-17 08:35:47 +02:00
Z
3e0be1cace
llguidance : official v0.7.20 release (no actual changes) [noci] ( #13594 )
2025-05-16 22:56:28 +02:00
Xuan-Son Nguyen
6aa892ec2a
server : do not return error out of context (with ctx shift disabled) ( #13577 )
2025-05-16 21:50:00 +02:00
Xuan-Son Nguyen
aea9f8b4e7
webui : improve accessibility for visually impaired people ( #13551 )
...
* webui : improve accessibility for visually impaired people
* add a11y for extra contents
* fix some labels being read twice
* add skip to main content
2025-05-16 21:49:01 +02:00
Xuan-Son Nguyen
06c1e4abc1
readme : add list of dependencies and their license ( #13591 )
2025-05-16 20:04:18 +02:00
Diego Devesa
415e40a357
releases : use arm version of curl for arm releases ( #13592 )
2025-05-16 19:36:51 +02:00
Georgi Gerganov
654a67794f
metal : add FA-vec kernel for head size 64 ( #13583 )
...
ggml-ci
2025-05-16 20:32:58 +03:00
Diego Devesa
5364ae4ba5
llama : print hint when loading a model when no backends are loaded ( #13589 )
2025-05-16 16:38:07 +02:00
Sigbjørn Skjæret
7c07ac244d
ci : add ppc64el to build-linux-cross ( #13575 )
2025-05-16 14:54:23 +02:00
Łukasz Ślusarczyk
0a338ed013
sycl : fixed compilation warnings ( #13582 )
2025-05-16 18:15:29 +08:00
Olivier Chafik
bc098c3cf0
minja: sync (qwen3) ( #13573 )
...
* minja: sync f06140fa52
- https://github.com/google/minja/pull/67 (@grf53)
- https://github.com/google/minja/pull/66 (@taha-yassine)
- https://github.com/google/minja/pull/63 (@grf53)
- https://github.com/google/minja/pull/58
---------
Co-authored-by: ochafik <ochafik@google.com>
2025-05-15 23:29:10 +01:00
Diego Devesa
c6a2c9e741
gguf : use ggml log system ( #13571 )
...
* gguf : use ggml log system
* llama : remove unnecessary new lines in exception messages
2025-05-15 19:13:11 +02:00
Daniel Tang
07ad2b6db3
gguf-py : fix disconnect-before-connect in editor-gui ( #13569 )
...
The bug caused a crash upon load with venvs created with
--system-site-packages to use
python3-pyside6.qtwidgets=python3-pyside6.qtwidgets=6.6.2-4
from Kubuntu 24.10.
2025-05-15 18:47:10 +02:00
Xuan-Son Nguyen
c531edfa34
convert : fix conversion for llama 4 ( #13567 )
2025-05-15 17:40:07 +02:00
Atharva Dubey
02cdd2d8b0
sycl: simplify bin_bcast_kernel ( #13383 )
2025-05-15 17:39:52 +02:00
Svetlozar Georgiev
64bb51cf90
sycl: reordered Q4_K MMVQ ( #13109 )
2025-05-15 17:35:44 +02:00
Łukasz Ślusarczyk
9c404ed54c
sycl: use oneDNN for matrices multiplication ( #12972 )
2025-05-15 16:53:41 +02:00
Diego Devesa
6c8b91500e
llama-bench : fix -ot with dl backends ( #13563 )
2025-05-15 15:46:55 +02:00
Xuan-Son Nguyen
3cc1f1f1d2
webui : handle PDF input (as text or image) + convert pasted long content to file ( #13562 )
...
* webui : handle PDF input (as text or image)
* handle the case where pdf image + server without mtmd
* fix bug missing pages
2025-05-15 14:24:50 +02:00
Piotr Wilkin (ilintar)
c753d7bed0
server : proper error handling for missing elements in messages array (OpenAI compatible backend) ( #13540 )
2025-05-15 08:40:58 +02:00
Georgi Gerganov
b2838049cc
bench : handle decode errors ( #13548 )
...
ggml-ci
2025-05-15 05:57:02 +03:00
Olivier Chafik
aa48e373f2
server
: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802 )
...
* Inject date_string in llama 3.x + fix for functionary v2
https://github.com/ggml-org/llama.cpp/issues/12729
* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation
---------
Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-05-15 02:39:51 +01:00
Georgi Gerganov
e3a9421b78
kv-cache : fix out-of-bounds view during reserve graph ( #13547 )
...
* kv-cache : fix reserve graph out-of-bounds access
ggml-ci
* cont : add comment
* cont : fix comments [no ci]
* cont : more correct comment [no ci]
2025-05-14 23:15:15 +03:00
Yibo Cai
5ab5d5fb25
arm64: optimize q6_k_q8_k kernel with i8mm ( #13519 )
...
This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.
Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above
Perplexity doesn't change with this PR.
```
// tested on neoverse-n2
$ llama-batched-bench \
-m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
--no-mmap -fa \
-c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
-npl 1,2,4,8,16,32 \
-t 64
---------------------------------------------------------------------
| PP | TG | B | S_PP t/s | S_TG t/s |
| | | | original | this pr | original | this pr |
|-------|--------|------|----------|----------|----------|----------|
| 128 | 128 | 1 | 78.52 | 109.18 | 18.63 | 18.88 |
| 128 | 128 | 2 | 84.62 | 123.94 | 34.54 | 36.92 |
| 128 | 128 | 4 | 84.36 | 122.49 | 52.65 | 61.32 |
| 128 | 128 | 8 | 90.52 | 138.87 | 63.46 | 84.41 |
| 128 | 128 | 16 | 90.11 | 138.56 | 71.04 | 101.33 |
| 128 | 128 | 32 | 89.81 | 137.79 | 75.14 | 110.47 |
---------------------------------------------------------------------
```
2025-05-14 21:53:52 +02:00
Olivier Chafik
3198405e98
common
: add partial regex support (#12808 )
...
* move string_find_partial_stop & string_ends_with to common
* add common_regex (supports partial matches)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/regex-partial.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* partial regex: add missing iterator end checks
* string utils: use string_views
* direct throw to avoid ggml.h include
* regex-partial: replace missed ggml_asserts
---------
Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-05-14 19:50:57 +01:00
Sigbjørn Skjæret
f5170c1d7a
editorconfig : fix trailing whitespace from #13542 ( #13546 )
2025-05-14 21:22:49 +03:00
Gilad S.
017f10b5fa
fix: crash when calling llama_state_get_size
on a context without a KV cache ( #13542 )
2025-05-14 19:18:18 +03:00
Johannes Gäßler
4696d56749
CUDA: fix crash on large batch size for quant. MoE ( #13537 )
2025-05-14 16:41:02 +02:00
Diego Devesa
b7d2672082
llama : fix quantize with dl backends ( #13539 )
2025-05-14 16:12:36 +02:00
Johannes Gäßler
6da34fa276
CUDA: faster Deepseek FA, add Turing support ( #13435 )
2025-05-14 16:08:20 +02:00
Gabe Goodhart
5e7d95e22e
fix: Move build_inp_pos to the top of the graph section for build_granite ( #13538 )
...
This matches how others do it, but will still avoid the extra
initialization when rope is disabled.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-05-14 15:53:59 +03:00
Georgi Gerganov
053174436f
server : passthrough the /models endpoint during loading ( #13535 )
...
* server : passthrough the /models endpoint during loading
* server : update readme + return json for "meta" field
2025-05-14 15:42:10 +03:00
Xuan-Son Nguyen
360a9c98e1
server : fix cache_tokens bug with no cache_prompt ( #13533 )
2025-05-14 13:35:07 +02:00
bandoti
09d13d94fb
cmake: simplify vulkan shader test logic ( #13263 )
2025-05-14 07:53:57 -03:00
Jeff Bolz
24e86cae72
vulkan: KHR_coopmat flash attention ( #13506 )
...
This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.
2025-05-14 11:55:26 +02:00
Xuan-Son Nguyen
bb1681fbd5
webui : use fflate for more deterministic gzip compress ( #13525 )
...
* webui : use pako for more deterministic gzip compress
* simpler code
* use fflate instead of pako
2025-05-14 10:26:12 +02:00
Luca Stefani
d486dd3e8e
webui: Allow pasting file from clipboard ( #13526 )
...
* server: Allow pasting file from clipboard
* server: Prevent default action on file paste
* update build
* format then build combined
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-14 10:07:31 +02:00