Nicolò Scipione
f7c9429c85
sycl : Overcoming workaround for mmap() allocation on Windows ( #13482 )
...
* Remove mmap workaround on windows
After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.
* Update llama-bench README
SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag
2025-05-20 08:54:43 +08:00
psocolovsky
1dfbf2cf3a
common : add load_progress_callback ( #13617 )
2025-05-19 21:17:36 +02:00
0cc4m
8960efd0a6
Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence ( #13607 )
2025-05-19 17:54:08 +02:00
Alberto Cabrera Pérez
725f23f1f3
sycl : backend documentation review ( #13544 )
...
* sycl: reviewing and updating docs
* Updates Runtime error codes
* Improves OOM troubleshooting entry
* Added a llama 3 sample
* Updated supported models
* Updated releases table
2025-05-19 14:38:20 +01:00
Xuan-Son Nguyen
92ecdcc06a
mtmd : add vision support for llama 4 ( #13282 )
...
* wip llama 4 conversion
* rm redundant __init__
* fix conversion
* fix conversion
* test impl
* try this
* reshape patch_embeddings_0
* fix view
* rm ffn_post_norm
* cgraph ok
* f32 for pos embd
* add image marker tokens
* Llama4UnfoldConvolution
* correct pixel shuffle
* fix merge conflicts
* correct
* add debug_graph
* logits matched, but it still preceives the image incorrectly
* fix style
* add image_grid_pinpoints
* handle llama 4 preprocessing
* rm load_image_size
* rm unused line
* fix
* small fix 2
* add test & docs
* fix llava-1.6 test
* test: add notion of huge models
* add comment
* add warn about degraded quality
2025-05-19 13:04:14 +02:00
Alberto Cabrera Pérez
f71f40a284
ci : upgraded oneAPI version in SYCL workflows and dockerfile ( #13532 )
2025-05-19 11:46:09 +01:00
Georgi Gerganov
d30cb5a7fa
sync : ggml
...
ggml-ci
2025-05-19 13:29:56 +03:00
Johannes Gäßler
6c35981a64
mnist: fix segmentation fault (ggml/1227)
2025-05-19 13:29:56 +03:00
Diego Devesa
8b5e19aea6
ggml : fix apple OS check in ggml_print_backtrace (ggml/1229)
2025-05-19 13:29:56 +03:00
Daniel Tang
60aea028b5
ggml : Fix missing backtrace on Linux (ggml/1228)
...
* Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1
* Fixed lldb attach
* Simplify by having the child do ggml_print_backtrace_symbols
2025-05-19 13:29:56 +03:00
Nick
9c55e5c5c2
fix: check model pointer validity before use ( #13631 )
2025-05-19 13:25:41 +03:00
Chenguang Li
33d7aed4a8
CANN: Support MOE Model MUL_MAT_ID ( #13042 )
...
Signed-off-by: noemotiovon <757486878@qq.com>
2025-05-19 14:21:17 +08:00
Isaac McFadyen
6a2bc8bfb7
server : added --no-prefill-assistant flag ( #13608 )
...
* added no-prefill-assistant flag
* reworded documentation comment
* updated server README.md
2025-05-17 23:59:48 +02:00
Gilad S.
e3a7cf6c5b
cmake: use the current build config for vulkan-shaders-gen ( #13595 )
...
* fix: use the current build config for `vulkan-shaders-gen`
* fix: only pass a valid build type to `--config`
2025-05-17 15:26:43 -03:00
Georgi Gerganov
518329b2d4
parallel : add option for non-shared and larger prompts ( #13598 )
...
* parallel : add option for non-shared and larger prompts
* parallel : update readme [no ci]
* cont : add note about base models [no ci]
* parallel : better var name
ggml-ci
2025-05-17 12:58:55 +03:00
Jeff Bolz
2f5a4e1e09
vulkan: move common FA code to flash_attn_base.comp ( #13556 )
...
* vulkan: move common FA code to flash_attn_base.comp
* vulkan: move common FA index/stride setup code to flash_attn_base.comp
* build fix
2025-05-17 09:14:55 +02:00
Jeff Bolz
4f41ee11d6
vulkan: use scalar FA rather than coopmat2 when N==1 ( #13554 )
2025-05-17 08:35:47 +02:00
Z
3e0be1cace
llguidance : official v0.7.20 release (no actual changes) [noci] ( #13594 )
2025-05-16 22:56:28 +02:00
Xuan-Son Nguyen
6aa892ec2a
server : do not return error out of context (with ctx shift disabled) ( #13577 )
2025-05-16 21:50:00 +02:00
Xuan-Son Nguyen
aea9f8b4e7
webui : improve accessibility for visually impaired people ( #13551 )
...
* webui : improve accessibility for visually impaired people
* add a11y for extra contents
* fix some labels being read twice
* add skip to main content
2025-05-16 21:49:01 +02:00
Xuan-Son Nguyen
06c1e4abc1
readme : add list of dependencies and their license ( #13591 )
2025-05-16 20:04:18 +02:00
Diego Devesa
415e40a357
releases : use arm version of curl for arm releases ( #13592 )
2025-05-16 19:36:51 +02:00
Georgi Gerganov
654a67794f
metal : add FA-vec kernel for head size 64 ( #13583 )
...
ggml-ci
2025-05-16 20:32:58 +03:00
Diego Devesa
5364ae4ba5
llama : print hint when loading a model when no backends are loaded ( #13589 )
2025-05-16 16:38:07 +02:00
Sigbjørn Skjæret
7c07ac244d
ci : add ppc64el to build-linux-cross ( #13575 )
2025-05-16 14:54:23 +02:00
Łukasz Ślusarczyk
0a338ed013
sycl : fixed compilation warnings ( #13582 )
2025-05-16 18:15:29 +08:00
Olivier Chafik
bc098c3cf0
minja: sync (qwen3) ( #13573 )
...
* minja: sync f06140fa52
- https://github.com/google/minja/pull/67 (@grf53)
- https://github.com/google/minja/pull/66 (@taha-yassine)
- https://github.com/google/minja/pull/63 (@grf53)
- https://github.com/google/minja/pull/58
---------
Co-authored-by: ochafik <ochafik@google.com>
2025-05-15 23:29:10 +01:00
Diego Devesa
c6a2c9e741
gguf : use ggml log system ( #13571 )
...
* gguf : use ggml log system
* llama : remove unnecessary new lines in exception messages
2025-05-15 19:13:11 +02:00
Daniel Tang
07ad2b6db3
gguf-py : fix disconnect-before-connect in editor-gui ( #13569 )
...
The bug caused a crash upon load with venvs created with
--system-site-packages to use
python3-pyside6.qtwidgets=python3-pyside6.qtwidgets=6.6.2-4
from Kubuntu 24.10.
2025-05-15 18:47:10 +02:00
Xuan-Son Nguyen
c531edfa34
convert : fix conversion for llama 4 ( #13567 )
2025-05-15 17:40:07 +02:00
Atharva Dubey
02cdd2d8b0
sycl: simplify bin_bcast_kernel ( #13383 )
2025-05-15 17:39:52 +02:00
Svetlozar Georgiev
64bb51cf90
sycl: reordered Q4_K MMVQ ( #13109 )
2025-05-15 17:35:44 +02:00
Łukasz Ślusarczyk
9c404ed54c
sycl: use oneDNN for matrices multiplication ( #12972 )
2025-05-15 16:53:41 +02:00
Diego Devesa
6c8b91500e
llama-bench : fix -ot with dl backends ( #13563 )
2025-05-15 15:46:55 +02:00
Xuan-Son Nguyen
3cc1f1f1d2
webui : handle PDF input (as text or image) + convert pasted long content to file ( #13562 )
...
* webui : handle PDF input (as text or image)
* handle the case where pdf image + server without mtmd
* fix bug missing pages
2025-05-15 14:24:50 +02:00
Piotr Wilkin (ilintar)
c753d7bed0
server : proper error handling for missing elements in messages array (OpenAI compatible backend) ( #13540 )
2025-05-15 08:40:58 +02:00
Georgi Gerganov
b2838049cc
bench : handle decode errors ( #13548 )
...
ggml-ci
2025-05-15 05:57:02 +03:00
Olivier Chafik
aa48e373f2
server
: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802 )
...
* Inject date_string in llama 3.x + fix for functionary v2
https://github.com/ggml-org/llama.cpp/issues/12729
* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation
---------
Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-05-15 02:39:51 +01:00
Georgi Gerganov
e3a9421b78
kv-cache : fix out-of-bounds view during reserve graph ( #13547 )
...
* kv-cache : fix reserve graph out-of-bounds access
ggml-ci
* cont : add comment
* cont : fix comments [no ci]
* cont : more correct comment [no ci]
2025-05-14 23:15:15 +03:00
Yibo Cai
5ab5d5fb25
arm64: optimize q6_k_q8_k kernel with i8mm ( #13519 )
...
This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.
Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above
Perplexity doesn't change with this PR.
```
// tested on neoverse-n2
$ llama-batched-bench \
-m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
--no-mmap -fa \
-c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
-npl 1,2,4,8,16,32 \
-t 64
---------------------------------------------------------------------
| PP | TG | B | S_PP t/s | S_TG t/s |
| | | | original | this pr | original | this pr |
|-------|--------|------|----------|----------|----------|----------|
| 128 | 128 | 1 | 78.52 | 109.18 | 18.63 | 18.88 |
| 128 | 128 | 2 | 84.62 | 123.94 | 34.54 | 36.92 |
| 128 | 128 | 4 | 84.36 | 122.49 | 52.65 | 61.32 |
| 128 | 128 | 8 | 90.52 | 138.87 | 63.46 | 84.41 |
| 128 | 128 | 16 | 90.11 | 138.56 | 71.04 | 101.33 |
| 128 | 128 | 32 | 89.81 | 137.79 | 75.14 | 110.47 |
---------------------------------------------------------------------
```
2025-05-14 21:53:52 +02:00
Olivier Chafik
3198405e98
common
: add partial regex support (#12808 )
...
* move string_find_partial_stop & string_ends_with to common
* add common_regex (supports partial matches)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/regex-partial.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* partial regex: add missing iterator end checks
* string utils: use string_views
* direct throw to avoid ggml.h include
* regex-partial: replace missed ggml_asserts
---------
Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-05-14 19:50:57 +01:00
Sigbjørn Skjæret
f5170c1d7a
editorconfig : fix trailing whitespace from #13542 ( #13546 )
2025-05-14 21:22:49 +03:00
Gilad S.
017f10b5fa
fix: crash when calling llama_state_get_size
on a context without a KV cache ( #13542 )
2025-05-14 19:18:18 +03:00
Johannes Gäßler
4696d56749
CUDA: fix crash on large batch size for quant. MoE ( #13537 )
2025-05-14 16:41:02 +02:00
Diego Devesa
b7d2672082
llama : fix quantize with dl backends ( #13539 )
2025-05-14 16:12:36 +02:00
Johannes Gäßler
6da34fa276
CUDA: faster Deepseek FA, add Turing support ( #13435 )
2025-05-14 16:08:20 +02:00
Gabe Goodhart
5e7d95e22e
fix: Move build_inp_pos to the top of the graph section for build_granite ( #13538 )
...
This matches how others do it, but will still avoid the extra
initialization when rope is disabled.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-05-14 15:53:59 +03:00
Georgi Gerganov
053174436f
server : passthrough the /models endpoint during loading ( #13535 )
...
* server : passthrough the /models endpoint during loading
* server : update readme + return json for "meta" field
2025-05-14 15:42:10 +03:00
Xuan-Son Nguyen
360a9c98e1
server : fix cache_tokens bug with no cache_prompt ( #13533 )
2025-05-14 13:35:07 +02:00
bandoti
09d13d94fb
cmake: simplify vulkan shader test logic ( #13263 )
2025-05-14 07:53:57 -03:00