llama.cpp

ver4a/llama.cpp

Fork 0

Commit graph

fbdfefe74e

llama : gemma3 : use output tensor if it exists in model weight (#12506) Xuan-Son Nguyen 2025-03-22 23:28:19 +01:00
ba932dfb50

ggml : fix quantized cpy op (#12310) Georgi Gerganov 2025-03-22 16:23:26 +02:00
fac63a3d78

musa: refine compute capability (#12493) R0CKSTAR 2025-03-22 17:11:37 +08:00
eddfb43850

vulkan: Optimize mul_mat_vec p021 and nc shaders (#12505) Jeff Bolz 2025-03-22 03:40:11 -05:00
4375415b4a

Vulkan: RTE rounding for cpy to quant (#12480) stduhpf 2025-03-21 20:34:50 +01:00
30c42ef5cb

vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (#12472) Eve 2025-03-21 19:27:47 +00:00
af04481e6b

model : do not repack if a GPU device is present (#12498) Georgi Gerganov 2025-03-21 16:14:29 +02:00
960e726077

chore : cleanup llama_model_loader::TENSOR_ usage (#12492) Sigbjørn Skjæret 2025-03-21 10:21:36 +01:00
ea1518e839

llama-tts : avoid crashes related to bad model file paths (#12482) marcoStocchi 2025-03-21 10:12:45 +01:00
1aa87ee53d

[SYCL] Fix build on Windows when ccache enabled (#9954) (#9976) 蕭澧邦 2025-03-21 14:58:47 +08:00
9ffcc9e374

sycl: cleanup oneDNN related code (#12097) Svetlozar Georgiev 2025-03-21 02:15:56 +00:00
e04643063b

webui : Prevent rerendering on textarea input (#12299) Woof Dog 2025-03-20 14:57:43 +00:00
dbb3a4739e

llama : make Qwen2MoE QKV bias optional (#12477) Sigbjørn Skjæret 2025-03-20 12:49:59 +01:00
3d82dbcbce

ggml : block interleaving support for Q4_K quantization for x86 AVX2 architecture (#12332) Srihari-mcw 2025-03-20 17:05:34 +05:30
732b5fbf5e

convert : avoid calls to tokenizer.added_tokens_decoder (#12473) Bartowski 2025-03-20 02:36:37 -04:00
568013d0cd

context : clear sets containing encoder output sequence ids before storing new values (#12470) fairydreaming 2025-03-19 21:01:57 +01:00
517b5ddbf0

CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183) Gaurav Garg 2025-03-20 01:22:06 +05:30
a9b59288e2

vulkan: optimize iq1 coopmat2 dequant functions (#12427) Jeff Bolz 2025-03-19 13:56:23 -05:00
0fd8487b14

Fix visionOS build and add CI (#12415) Guus Waals 2025-03-19 10:15:23 +00:00
108e53c2f1

llama : add support for GPT2, Bloom and CodeShell tied word embeddings (#12456) Sigbjørn Skjæret 2025-03-19 09:08:49 +01:00
a686171ea7

convert : Support chat_template.json (#12460) Sigbjørn Skjæret 2025-03-19 08:58:13 +01:00
c446b2edd2

vulkan: Submit once enough matmul work has been recorded (#12406) Jeff Bolz 2025-03-19 02:26:26 -05:00
d84635b1b0

opencl: improve profiling (#12442) lhez 2025-03-18 12:54:55 -07:00
75422e8bc4

graph : normalize Q, K, V shapes + sync cross attention (#12449) Georgi Gerganov 2025-03-18 21:35:19 +02:00
bb115d2bf7

musa: override warp_size of musa device to 32 (#12445) R0CKSTAR 2025-03-19 02:28:26 +08:00
29fff308c7

llama : support converting Mistral Small text-only (#12450) Xuan-Son Nguyen 2025-03-18 19:16:19 +01:00
c6af2161b2

speculative : fix seg fault in certain cases (#12454) Georgi Gerganov 2025-03-18 19:35:11 +02:00
99aa304fb9

llama : add support for EXAONE tied word embeddings (#12451) Xuan-Son Nguyen 2025-03-18 17:24:33 +01:00
8551c44d84

context : always use non-causal attention for encoder graphs (#12447) Georgi Gerganov 2025-03-18 13:05:49 +02:00
35cae5ba05

SYCL: using graphs is configurable by environment variable and compile option (#12371) Łukasz Ślusarczyk 2025-03-18 11:16:31 +01:00
810e0af3f5

server : fix warmup draft cache type (#12446) Georgi Gerganov 2025-03-18 12:05:42 +02:00
eba92d64c3

cmake : fix PowerPC build (#12241) Prajwal B Mehendarkar 2025-03-18 15:07:33 +05:30
d9a14523bb

ggml : add SVE support for q6_K_q8_K (#12361) fj-y-saito 2025-03-18 17:14:39 +09:00
fd123cfead

Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (#12434) 0cc4m 2025-03-18 07:21:40 +01:00
a53f7f7b88

fixed compilation warnings in ggml-sycl (#12424) Łukasz Ślusarczyk 2025-03-18 01:51:25 +01:00
7dfad387e3

llama: Add support for RWKV v7 architecture (#12412) Molly Sophia 2025-03-18 07:27:50 +08:00
60c902926c

docs : bring llama-cli conversation/template docs up-to-date (#12426) Sigbjørn Skjæret 2025-03-17 21:14:32 +01:00
b1b132efcb

cuda : enable CUDA Graph on CUDA Toolkit < 12.x (#12394) Gaurav Garg 2025-03-17 23:55:13 +05:30
01e8f2138b

ggml-vulkan: remove unused find_program(glslc) (#12416) Guus Waals 2025-03-18 00:35:43 +08:00
484a8ab513

vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (#12312) Jeff Bolz 2025-03-17 09:26:18 -05:00
cf2270e4d3

vulkan: subgroup size tuning (#12087) Daniele 2025-03-17 12:42:33 +01:00
f07690c930

vulkan: use fp32 in coopmat2 q4_k dequant function (#12309) Jeff Bolz 2025-03-17 04:43:35 -05:00
891c63956d

vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking (#12273) Jeff Bolz 2025-03-17 04:41:59 -05:00
2f21123c1d

vulkan: Adjust coopmat2 tile sizes and selection heuristic (#12258) Jeff Bolz 2025-03-17 04:35:00 -05:00
374101fd74

cmake : enable building llama.cpp using system libggml (#12321) Christian Kastner 2025-03-17 10:05:23 +01:00
b3c9a65673

SYCL: set extras only on GGML_TYPE_Q4_0 (#12366) Akarshan Biswas 2025-03-17 07:15:12 +05:30
8ba95dca20

llama : fix OLMo-2-0325-32B-Instruct K-norm size (#12400) Sigbjørn Skjæret 2025-03-16 18:46:36 +01:00
dc079cfdff

context : fix init of n_outputs (#12397) Georgi Gerganov 2025-03-16 19:29:36 +02:00
7b61bcc87c

ci : add --symlinks to xcframework zip command (#12409) Daniel Bevenius 2025-03-16 18:22:05 +01:00
f4c3dd5daa

llama-tts : add '-o' option (#12398) marcoStocchi 2025-03-15 17:23:11 +01:00
3d35d87b41

SYCL: Delete redundant plus sign and space (#12391) aubreyli 2025-03-15 22:49:03 +08:00
b19bd064c0

SYCL : support non-contiguous tensors in binary ops (add, sub, etc) (#12399) fairydreaming 2025-03-15 15:19:30 +01:00
92a391327e

[CANN]MUL_MAT optimization (#12382) Chenguang Li 2025-03-15 09:31:08 +08:00
9f2250ba72

Add CLI arg to llama-run to adjust the number of threads used (#12370) Eric Curtin 2025-03-14 16:41:20 +00:00
774973b8f3

main : add -sysf / --system-prompt-file (#12249) (#12250) Sigbjørn Skjæret 2025-03-14 16:57:05 +01:00
8fcb563613

Load all MoE experts during warmup (#11571) fairydreaming 2025-03-14 13:47:05 +01:00
add2a3aa5a

server: fix "--grammar-file" parameter (#12285) Victor 2025-03-14 11:21:17 +01:00
c522ce4143

graph : simplify attn input build for unified KV cache (#12381) Georgi Gerganov 2025-03-14 10:47:44 +02:00
081bee8c64

hparams : add SWA rope parameters (#12374) Georgi Gerganov 2025-03-14 09:03:24 +02:00
84d5475541

llama : fix Gemma3 SWA KV cache shift (#12373) Georgi Gerganov 2025-03-13 19:08:07 +02:00
be7c303410

arg : no n_predict = -2 for examples except for main and infill (#12364) Xuan-Son Nguyen 2025-03-13 12:34:54 +01:00
e0dbec0bc6

llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181) Georgi Gerganov 2025-03-13 12:35:44 +02:00
2048b5913d

server : fix crash when using verbose output with input tokens that are not in printable range (#12178) (#12338) Ishaan Gandhi 2025-03-13 06:10:05 -04:00
f08f4b3187

Update build.yml for Windows Vulkan builder to use Vulkan 1.4.304 SDK for VK_NV_cooperative_matrix2 support (#12301) Oscar Barenys 2025-03-12 20:06:58 +01:00
80a02aa858

llama.swiftui : fix xcframework dir in README [no ci] (#12353) Daniel Bevenius 2025-03-12 13:45:32 +01:00
363f8c5d67

sycl : variable sg_size support for mmvq kernels (#12336) Alberto Cabrera Pérez 2025-03-12 09:57:32 +00:00
34c961b181

CUDA/HIP: Fix fattn-vec-* when device warp size is not 32 (#12315) uvos 2025-03-12 10:14:11 +01:00
7841fc723e

llama : Add Gemma 3 support (+ experimental vision capability) (#12343) Xuan-Son Nguyen 2025-03-12 09:30:24 +01:00
bf69cfe62f

vulkan: fix bug in coopmat1 mul_mat_id (#12316) Jeff Bolz 2025-03-12 00:59:19 -05:00
10f2e81809

CUDA/HIP: refractor mmqv to unify the calculation of nwarps and rows per block between host and device code. (#12177) uvos 2025-03-11 20:16:03 +01:00
ba7654380a

ggml-backend : fix backend search path (#12330) jklincn 2025-03-11 21:25:17 +08:00
6ab2e4765a

metal : Cache the Metal library at the device context level (#12265) BB-fat 2025-03-11 19:45:02 +08:00
96e1280839

clip : bring back GPU support (#12322) Xuan-Son Nguyen 2025-03-11 09:20:16 +01:00
2c9f833d17

mat vec double buffer (#12188) Eve 2025-03-10 19:28:11 +00:00
251364549f

musa: support new arch mp_31 and update doc (#12296) R0CKSTAR 2025-03-11 01:18:25 +08:00
8acdacb3ea

opencl: use OpenCL C standard supported by the device (#12221) Henry Linjamäki 2025-03-10 18:57:00 +02:00
89b2b56e86

readme: added Sidekick to available UIs (#12311) John Bean 2025-03-10 22:13:09 +08:00
e128a1bf5b

tests : fix test-quantize-fns to init the CPU backend (#12306) Georgi Gerganov 2025-03-10 14:07:15 +02:00
6ef79a67ca

common : refactor '-o' option (#12278) marcoStocchi 2025-03-10 12:34:13 +01:00
4e39a3c332

server: extract <think> tags from qwq outputs (#12297) Olivier Chafik 2025-03-10 10:59:03 +00:00
be421fc429

tool-call: ensure there's always a non-empty tool call id (#12292) Olivier Chafik 2025-03-10 09:45:29 +00:00
87c2630546

allow missing content in message if tool_calls provided (#12293) Olivier Chafik 2025-03-10 09:45:07 +00:00
2b3a25c212

sampler: fixes trigger tokens + lazy grammars (fix typo cast from token to string) (#12291) Olivier Chafik 2025-03-10 09:44:42 +00:00
8352cdc87b

llava : fix bug in minicpm-v code (#11513) tc-mb 2025-03-10 16:33:24 +08:00
1e2f78a004

server : add speculative decoding presets for FIM (#12287) Georgi Gerganov 2025-03-09 19:08:20 +02:00
0fd7ca7a21

authors : update (#12271) Georgi Gerganov 2025-03-08 18:26:00 +02:00
6fefc05a7a

ggml-backend : make path_str compatible with C++20 (#12269) Jason C.H 2025-03-09 00:02:39 +08:00
7ab364390f

server : infill gen ends on new line (#12254) Georgi Gerganov 2025-03-07 20:54:30 +02:00
7c7f3b7f43

ggml : skip intermediate .air file when compiling .metallib (#12247) Daniel Bevenius 2025-03-07 14:15:27 +01:00
102ac1891d sync : ggml Georgi Gerganov 2025-03-07 14:00:27 +02:00
d6ae2fa061 ggml : ggml_compute_forward_concat() for arbitrary tensor type (ggml/1118) vmobilis 2025-03-07 11:11:40 +03:00
68d0027f3d

ggml-cpu: faster AVX2 variant for IQ1_M (#12216) Rémy O 2025-03-07 12:54:22 +01:00
ea002810a2

ci : fix save-load test invocations (#12245) Georgi Gerganov 2025-03-07 12:19:31 +02:00
8fad3c7a7c

server : Log original chat template parsing error (#12233) Sigbjørn Skjæret 2025-03-07 11:15:33 +01:00
7cf64f6bee

sync: minja - support QwQ-32B (#12235) Olivier Chafik 2025-03-07 09:33:37 +00:00
5e2d57b2b2

metal : simplify kernel arguments using a struct (#3229) (#12194) BB-fat 2025-03-07 15:35:57 +08:00
f1648e91cf

HIP: fix rocWMMA build flags under Windows (#12230) David Huang 2025-03-07 15:06:08 +08:00
d6c95b0740

metal : fix default.metallib build (#12224) Daniel Bevenius 2025-03-07 06:23:16 +01:00
d76a86d967

opencl: Noncontiguous norm, rms_norm, disable fp16 for some ops (#12217) lhez 2025-03-06 16:20:35 -08:00
776f9e59cc

cmake : fix undefined reference errors for std::filesystem in ggml (#12092) (#12094) xiaofei 2025-03-07 06:58:25 +08:00