llama.cpp

ver4a/llama.cpp

Fork 0

Commit graph

42158ae2e8

server : fix first message identification (#13634) Dorin-Andrei Geman 2025-05-21 16:07:57 +03:00
797f2ac062

kv-cache : simplify the interface (#13660) Georgi Gerganov 2025-05-21 15:11:13 +03:00
b44890df2e

model : disable SWA for Phi models (#13676) Georgi Gerganov 2025-05-21 13:09:21 +03:00
33983057d0

musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy (#13647) R0CKSTAR 2025-05-21 09:58:49 +08:00
fb1cab201c

vulkan: fix warnings (#13626) Eve 2025-05-20 21:35:16 +00:00
b7a17463ec

mtmd-helper : bug fix to token batching in mtmd (#13650) l3utterfly 2025-05-21 00:55:30 +08:00
be0239693c

model : fix llama4 graph (#13663) Georgi Gerganov 2025-05-20 19:21:04 +03:00
a4090d1174

llama : remove llama_kv_cache_view API + remove deprecated (#13653) Georgi Gerganov 2025-05-20 16:13:16 +03:00
b69f1647f9

CUDA: skip fully masked-out KV in FA vec kernel (#13584) Johannes Gäßler 2025-05-20 14:45:07 +02:00
759e37b0d8

tests : avoid github urls due to throttling (#13654) Sigbjørn Skjæret 2025-05-20 12:03:17 +02:00
4245e622e0

sycl: disable reorder for sycl mulmat (#13536) Svetlozar Georgiev 2025-05-20 10:34:15 +01:00
c9c64dee57

Set GLM4 blk.*.attn_output.weight, kqv_out-* matmul to GGML_PREC_F32 to fix infinity values in output (#13639) 0cc4m 2025-05-20 10:11:56 +02:00
c00a2634be

metal : fix typo in FA kernel comments (#13651) Georgi Gerganov 2025-05-20 10:41:40 +03:00
e298d2fbd0

kv-cache : add SWA support (#13194) Georgi Gerganov 2025-05-20 08:05:46 +03:00
f0adb80bf7

CANN: Update CANN model support (#13162) Xinpeng Dou 2025-05-20 11:43:43 +08:00
f7c9429c85

sycl : Overcoming workaround for mmap() allocation on Windows (#13482) Nicolò Scipione 2025-05-20 02:54:43 +02:00
1dfbf2cf3a

common : add load_progress_callback (#13617) psocolovsky 2025-05-19 21:17:36 +02:00
8960efd0a6

Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (#13607) 0cc4m 2025-05-19 17:54:08 +02:00
725f23f1f3

sycl : backend documentation review (#13544) Alberto Cabrera Pérez 2025-05-19 14:38:20 +01:00
92ecdcc06a

mtmd : add vision support for llama 4 (#13282) Xuan-Son Nguyen 2025-05-19 13:04:14 +02:00
f71f40a284

ci : upgraded oneAPI version in SYCL workflows and dockerfile (#13532) Alberto Cabrera Pérez 2025-05-19 11:46:09 +01:00
d30cb5a7fa sync : ggml Georgi Gerganov 2025-05-19 12:50:29 +03:00
6c35981a64 mnist: fix segmentation fault (ggml/1227) Johannes Gäßler 2025-05-19 09:33:35 +02:00
8b5e19aea6 ggml : fix apple OS check in ggml_print_backtrace (ggml/1229) Diego Devesa 2025-05-18 18:30:13 -07:00
60aea028b5 ggml : Fix missing backtrace on Linux (ggml/1228) Daniel Tang 2025-05-17 19:06:26 -04:00
9c55e5c5c2

fix: check model pointer validity before use (#13631) Nick 2025-05-19 18:25:41 +08:00
33d7aed4a8

CANN: Support MOE Model MUL_MAT_ID (#13042) Chenguang Li 2025-05-19 14:21:17 +08:00
6a2bc8bfb7

server : added --no-prefill-assistant flag (#13608) Isaac McFadyen 2025-05-17 17:59:48 -04:00
e3a7cf6c5b

cmake: use the current build config for vulkan-shaders-gen (#13595) Gilad S. 2025-05-17 21:26:43 +03:00
518329b2d4

parallel : add option for non-shared and larger prompts (#13598) Georgi Gerganov 2025-05-17 12:58:55 +03:00
2f5a4e1e09

vulkan: move common FA code to flash_attn_base.comp (#13556) Jeff Bolz 2025-05-17 16:14:55 +09:00
4f41ee11d6

vulkan: use scalar FA rather than coopmat2 when N==1 (#13554) Jeff Bolz 2025-05-17 15:35:47 +09:00
3e0be1cace

llguidance : official v0.7.20 release (no actual changes) [noci] (#13594) Z 2025-05-16 14:56:28 -06:00
6aa892ec2a

server : do not return error out of context (with ctx shift disabled) (#13577) Xuan-Son Nguyen 2025-05-16 21:50:00 +02:00
aea9f8b4e7

webui : improve accessibility for visually impaired people (#13551) Xuan-Son Nguyen 2025-05-16 21:49:01 +02:00
06c1e4abc1

readme : add list of dependencies and their license (#13591) Xuan-Son Nguyen 2025-05-16 20:04:18 +02:00
415e40a357

releases : use arm version of curl for arm releases (#13592) Diego Devesa 2025-05-16 10:36:51 -07:00
654a67794f

metal : add FA-vec kernel for head size 64 (#13583) Georgi Gerganov 2025-05-16 20:32:58 +03:00
5364ae4ba5

llama : print hint when loading a model when no backends are loaded (#13589) Diego Devesa 2025-05-16 07:38:07 -07:00
7c07ac244d

ci : add ppc64el to build-linux-cross (#13575) Sigbjørn Skjæret 2025-05-16 14:54:23 +02:00
0a338ed013

sycl : fixed compilation warnings (#13582) Łukasz Ślusarczyk 2025-05-16 12:15:29 +02:00
bc098c3cf0

minja: sync (qwen3) (#13573) Olivier Chafik 2025-05-15 23:29:10 +01:00
c6a2c9e741

gguf : use ggml log system (#13571) Diego Devesa 2025-05-15 10:13:11 -07:00
07ad2b6db3

gguf-py : fix disconnect-before-connect in editor-gui (#13569) Daniel Tang 2025-05-15 12:47:10 -04:00
c531edfa34

convert : fix conversion for llama 4 (#13567) Xuan-Son Nguyen 2025-05-15 17:40:07 +02:00
02cdd2d8b0

sycl: simplify bin_bcast_kernel (#13383) Atharva Dubey 2025-05-15 16:39:52 +01:00
64bb51cf90

sycl: reordered Q4_K MMVQ (#13109) Svetlozar Georgiev 2025-05-15 16:35:44 +01:00
9c404ed54c

sycl: use oneDNN for matrices multiplication (#12972) Łukasz Ślusarczyk 2025-05-15 16:53:41 +02:00
6c8b91500e

llama-bench : fix -ot with dl backends (#13563) Diego Devesa 2025-05-15 06:46:55 -07:00
3cc1f1f1d2

webui : handle PDF input (as text or image) + convert pasted long content to file (#13562) Xuan-Son Nguyen 2025-05-15 14:24:50 +02:00
c753d7bed0

server : proper error handling for missing elements in messages array (OpenAI compatible backend) (#13540) Piotr Wilkin (ilintar) 2025-05-15 08:40:58 +02:00
b2838049cc

bench : handle decode errors (#13548) Georgi Gerganov 2025-05-15 05:57:02 +03:00
aa48e373f2

server: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802) Olivier Chafik 2025-05-15 02:39:51 +01:00
e3a9421b78

kv-cache : fix out-of-bounds view during reserve graph (#13547) Georgi Gerganov 2025-05-14 23:15:15 +03:00
5ab5d5fb25

arm64: optimize q6_k_q8_k kernel with i8mm (#13519) Yibo Cai 2025-05-15 03:53:52 +08:00
3198405e98

common: add partial regex support (#12808) Olivier Chafik 2025-05-14 19:50:57 +01:00
f5170c1d7a

editorconfig : fix trailing whitespace from #13542 (#13546) Sigbjørn Skjæret 2025-05-14 20:22:49 +02:00
017f10b5fa

fix: crash when calling llama_state_get_size on a context without a KV cache (#13542) Gilad S. 2025-05-14 19:18:18 +03:00
4696d56749

CUDA: fix crash on large batch size for quant. MoE (#13537) Johannes Gäßler 2025-05-14 16:41:02 +02:00
b7d2672082

llama : fix quantize with dl backends (#13539) Diego Devesa 2025-05-14 07:12:36 -07:00
6da34fa276

CUDA: faster Deepseek FA, add Turing support (#13435) Johannes Gäßler 2025-05-14 16:08:20 +02:00
5e7d95e22e

fix: Move build_inp_pos to the top of the graph section for build_granite (#13538) Gabe Goodhart 2025-05-14 06:53:59 -06:00
053174436f

server : passthrough the /models endpoint during loading (#13535) Georgi Gerganov 2025-05-14 15:42:10 +03:00
360a9c98e1

server : fix cache_tokens bug with no cache_prompt (#13533) Xuan-Son Nguyen 2025-05-14 13:35:07 +02:00
09d13d94fb

cmake: simplify vulkan shader test logic (#13263) bandoti 2025-05-14 07:53:57 -03:00
24e86cae72

vulkan: KHR_coopmat flash attention (#13506) Jeff Bolz 2025-05-14 18:55:26 +09:00
bb1681fbd5

webui : use fflate for more deterministic gzip compress (#13525) Xuan-Son Nguyen 2025-05-14 10:26:12 +02:00
d486dd3e8e

webui: Allow pasting file from clipboard (#13526) Luca Stefani 2025-05-14 10:07:31 +02:00
21ca987fba

docs: Update link to ggml-org in multimodal.md (#13513) ddpasa 2025-05-14 09:59:12 +02:00
be1d4a13db

scripts : fix compare-llama-bench.py show parameter (#13514) Sigbjørn Skjæret 2025-05-14 08:41:01 +02:00
ab3971f2a0

vulkan: workaround FA compile failures on macos (#13517) Jeff Bolz 2025-05-14 13:15:50 +09:00
e5c834f718

quantize : improve tensor-type pattern matching (#13033) Ed Addario 2025-05-13 18:12:31 +01:00
71bdbdb587

clip : clip.h become private API (⚠️ breaking change) (#13510) Xuan-Son Nguyen 2025-05-13 17:07:21 +02:00
f0995d28ce

metal : use FA-vec kernel up to batch size 20 (#13496) Georgi Gerganov 2025-05-13 18:04:39 +03:00
c252e0c409

metal : optimize multi-sequence FA vec kernel (#13493) Georgi Gerganov 2025-05-13 18:04:00 +03:00
4f711afed5

ggml-cpu: Update KleidiAI to v1.6 and fix include directives (#13509) Dan Johansson 2025-05-13 17:02:28 +02:00
b89d605a91

batched-bench : fix pp batch contents (#13492) Georgi Gerganov 2025-05-13 18:01:53 +03:00
b4726345ac

mtmd : remove libllava, remove clip-quantize-cli (⚠️ breaking change) (#13460) Xuan-Son Nguyen 2025-05-13 15:33:58 +02:00
bf79371120

scripts : support arbitrary input file formats in compare-llama-bench.py (#13455) Sigbjørn Skjæret 2025-05-13 15:31:12 +02:00
d590cd4c24

model : Granite MoE shared (#13269) Gabe Goodhart 2025-05-13 07:12:01 -06:00
1e2809bc4b sync : ggml Georgi Gerganov 2025-05-13 14:01:45 +03:00
cf0a43bb64

llama-bench : add defrag-thold, check for invalid ranges (#13487) Diego Devesa 2025-05-12 15:31:37 -07:00
f0d46ef157

opencl: remove unnecessary assert for add (#13257) lhez 2025-05-12 13:13:49 -07:00
de4c07f937

clip : cap max image size 1024 for qwen vl model (#13478) Xuan-Son Nguyen 2025-05-12 15:06:51 +02:00
10d2af0eaa

llama/ggml: add LLM training support (#10544) Johannes Gäßler 2025-05-12 14:44:49 +02:00
064cc596ac

context : fix state io for memory-less contexts (#13470) Georgi Gerganov 2025-05-12 15:12:27 +03:00
91159ee9df

server : allow content to be null in oaicompat_completion_params_parse (#13477) Anudit Nagar 2025-05-12 18:56:42 +07:00
22cdab343b

llama-bench : accept ranges for integer parameters (#13410) Diego Devesa 2025-05-12 13:08:22 +02:00
a71a4075cd

ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel (#13053) Dan Johansson 2025-05-12 13:06:19 +02:00
95e18884fc

CUDA: fix misaligned synchronization in FA (#13469) Johannes Gäßler 2025-05-12 10:51:21 +02:00
df8491922f

ggml : add mrope kernel for metal (#13457) Xuan-Son Nguyen 2025-05-12 10:29:13 +02:00
14492144c2

enable dpcpp nightly builds with libraries (#13406) Atharva Dubey 2025-05-12 06:15:32 +01:00
c104023994

mtmd : Use RMS norm for InternVL 3 38B and 78B mmproj (#13459) City 2025-05-12 00:39:06 +02:00
9a390c4829

tools : fix uninitialized llama_batch in server (#13436) Anthony Umfer 2025-05-11 11:08:26 -04:00
09232370fc

scripts : exit compare-llama-bench.py gracefully when there's nothing to compare (#13451) Sigbjørn Skjæret 2025-05-11 16:20:39 +02:00
7474e00b34

CUDA: fix crash with partial offloading of MoE (#13439) Johannes Gäßler 2025-05-11 16:09:33 +02:00
7f323a589f

Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B (#13386) David Huang 2025-05-11 20:18:39 +08:00
3eac209319

mtmd : support InternVL 3 38B and 78B mmproj (#13443) City 2025-05-11 11:35:52 +02:00
a634d75d1b

mtmd : move helpers to dedicated file (#13442) Xuan-Son Nguyen 2025-05-11 11:34:23 +02:00
62d4250e52

docs : Fix typo in InternVL3 model name (#13440) Thomas Germer 2025-05-10 22:26:46 +02:00