llama.cpp

Author	SHA1	Message	Date
Diego Devesa	1d36b3670b	llama : move end-user examples to tools directory (#13249 ) * llama : move end-user examples to tools directory --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-05-02 20:27:13 +02:00
Georgi Gerganov	b34443923c	sync : ggml (#13268 ) * vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) (ggml/1204) * vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW) * review: remove src_x/y < 0 checks; add performance tests * sync : ggml ggml-ci * vulkan : fix lint (#0) --------- Co-authored-by: Acly <aclysia@gmail.com>	2025-05-02 20:54:30 +03:00
Georgi Gerganov	a75cb30dc9	context : fix reorder logic (#13267 ) ggml-ci	2025-05-02 20:54:13 +03:00
shalinib-ibm	3f3769ba76	ggml : Enable MMA for BF16 in llamafile_sgemm (#13148 ) This patch upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for BF16 data type. This change results in 9x - 40x gains in total speed S t/s (ie all tokens/total time), across various batch sizes tested using llama-batched-bench benchmark. The patch is tested with Meta-Lllama-3-8B, and Mistral-7B models (BF16 models generated by using llama-quantize from corresponding FP32 models) on an IBM POWER10 machine. Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>	2025-05-02 19:53:12 +03:00
Jared Van Bortel	2f567611c0	llama-model : support Qwen2 embedding models and pooling_mode_lasttoken (#13245 )	2025-05-02 11:42:30 -04:00
Jared Van Bortel	7d2123484e	convert : use correct context length for nomic-embed-text-v2 (#13216 )	2025-05-02 11:41:54 -04:00
Xuan-Son Nguyen	074e42ab31	convert : converting mmproj for Qwen2/2.5VL from convert_hf_to_gguf (#13209 ) * wip * qwen2.5vl ok * vision: fix models missing "text_config" * add test * fix test repo name * fix 32B model * Revert "fix 32B model" This reverts commit 651752f1ae25fe8a01c1e57c18cf2eca80b2774e. * clarify about 32B * rm qwen surgery script * update llava/readme * move V_ENC_EMBD_PATCH handling to Qwen2VLVisionModel	2025-05-02 17:17:15 +02:00
Georgi Gerganov	c642bc014c	kv-cache : separate recurrent vs non-recurrent impl (#12799 ) * kv-cache : serparate recurrent vs non-recurrent impl (wip) ggml-ci * kv-cache : init -> contructor + add llama_memory_params ggml-ci * kv-cache : fix callback reference ggml-ci * context : llama_kv_cache -> llama_memory_i ggml-ci * context : move memory creation logic to model ggml-ci * llama : remove reference of memory during encode ggml-ci * kv-cache : hide padding details in the implementation ggml-ci * kv-cache : add ubatch_next() ggml-ci * context : simplify sbatch logic ggml-ci * kv-cache : hide defrag logic in the implementation ggml-ci * context : hide kv cache details in implementation ggml-ci * build : fix ggml-ci * cont : another fix ggml-ci * kv-cache : simplify interface (wip) ggml-ci * kv-cache : use separate KV cell structs for unified/recurrent ggml-ci * kv-cache : clean-up ggml-ci * model : better llama_model::create_model() signature ggml-ci * kv-cache : fix recurrent seq_rm() ggml-ci * kv-cache : replace `struct callbacks` with `llama_model &` ggml-ci * kv-cache : replace `struct graph_params` with `llama_context &` ggml-ci * kv-cache : fix offload check ggml-ci * context : avoid passing unique_ptr ggml-ci * kv-cache : avoid using the backends from the llama_context ref #13113 ggml-ci * kv-cache : more consistent debug logs [no ci] * kv-cache : do not pass the full llama_context for kv graphs ggml-ci * kv-cache : remove comment * kv-cache : ggml_rope_ext_inplace -> ggml_rope_ext ggml-ci * kv-cache : fix recurrent multi-user case ggml-ci * memory : remove comments [no ci]	2025-05-02 17:48:36 +03:00
Sigbjørn Skjæret	cb06a3c363	llama : orion rope type is neox (#13261 )	2025-05-02 12:44:24 +02:00
Sigbjørn Skjæret	626083faf7	llama : plamo rope type is neox (#13260 )	2025-05-02 12:40:56 +02:00
piDack	2af6880178	llama-chat : reset glmedge chat template (#13253 ) * reset glmedge chat template * fix glmedge chat template	2025-05-02 11:06:09 +02:00
Shakil Ahmed	e84773ab60	mtmd-cli : fix out_of_range when input image path is empty (#13244 ) * fix out_of_range error to keep the chat loop running * Update examples/llava/mtmd-cli.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * mtmd-cli : load image right away * add a new line for readability * rm printf * Update examples/llava/mtmd-cli.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update examples/llava/mtmd-cli.cpp --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-05-02 10:20:27 +02:00
Georgi Gerganov	fab647e884	server : add cache reuse card link to help (#13230 ) * server : add cache reuse card link to help * args : use short url	2025-05-02 09:48:31 +03:00
Xuan-Son Nguyen	dcf886007d	convert : explicitly disable trust_remote_code for AutoConfig (#13246 )	2025-05-02 08:45:10 +02:00
bandoti	d24d592808	ci: fix cross-compile sync issues (#12804 )	2025-05-01 19:06:39 -03:00
Justin Santa Barbara	8efbdadc61	rpc : avoid uninitialized memory in serialize_tensor (#13210 ) Zero out the name and padding buffers.	2025-05-01 23:32:11 +02:00
Jesse Gross	f057808ffa	ggml: Don't assert fail when tensor data changes (#13222 ) The following scenario will cause an assertion failure in the graph allocator: - Build and allocate a graph containing a tensor with a non-NULL data pointer - Build and allocate a new graph where that data is NULL Result: ggml-alloc.c:819: GGML_ASSERT(talloc->buffer_id >= 0) failed This happens during revalidation because we think that memory should have been previously allocated based on the current graph but in reality the previous graph was different. In this situation, we should do a full reallocation pass.	2025-05-01 22:46:10 +02:00
Diego Devesa	d7a14c42a1	build : fix build info on windows (#13239 ) * build : fix build info on windows * fix cuda host compiler msg	2025-05-01 21:48:08 +02:00
Loïc Carrère	b6e4ff69b8	clip : (minicpmv) Re-enable upscaling of images smaller than the CLIP image size (#13237 )	2025-05-01 21:32:21 +02:00
matteo	e0f572c846	llama-chat : update GLM4 chat template (#13238 ) * update GLM4 chat template * Update chat template Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-05-01 21:16:38 +02:00
Jeff Bolz	79f26e9e12	vulkan: Add bfloat16 support (#12554 ) * vulkan: Add bfloat16 support This adds bfloat16 matrix multiply support based on VK_KHR_shader_bfloat16. The extension is required for coopmat multiply support, but matrix-vector multiply trivially promotes bf16 to fp32 and doesn't require the extension. The copy/get_rows shaders also don't require the extension. It's probably possible to fall back to non-coopmat and promote to fp32 when the extension isn't supported, but this change doesn't do that. The coopmat support also requires a glslc that supports the extension, which currently requires a custom build. * vulkan: Support bf16 tensors without the bf16 extension or coopmat support Compile a variant of the scalar mul_mm shader that will promote the bf16 values to float, and use that when either the bf16 extension or the coopmat extensions aren't available. * vulkan: bfloat16 fixes (really works without bfloat16 support now) * vulkan: fix spirv-val failure and reenable -O	2025-05-01 20:49:39 +02:00
Jeff Bolz	fc727bcdd5	vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader (#13191 ) * vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader	2025-05-01 20:19:31 +02:00
Johannes Gäßler	b0ecbd434b	test: non-cont. b in test-backend-ops -o MUL_MAT (#13187 )	2025-05-01 20:18:56 +02:00
Georgi Gerganov	b1dd4d08e8	sync : ggml ggml-ci	2025-05-01 20:15:34 +03:00
Daniel Bevenius	99881f77d8	whisper : add check that target name exists (whisper/3103) This commit adds a check to makes sure that the target exists before trying to add compile options to ignore warnings when using MSVC. The motivation for this is currently the build is broken depending on the cmake options provided. With this fix it should be possible to build even if the targets are not actually available. Refs: https://github.com/ggml-org/whisper.cpp/pull/3090#issuecomment-2842760104	2025-05-01 20:15:34 +03:00
Daniel Bevenius	b5769d92b4	ggml : suppress Windows compiler warnings (whisper/3075) * whisper: suppress Windows compiler warnings This commit disables compiler warnings on window using MSVC. The motivation for these changes is that some compilers generate warnings for these conversion, for example Windows MSVC, and there are quite a few of them. This makes it a little difficult to spot new warnings that may be introduced and also can be difficult for users/embedders of ggml where these warnings are hard to separate from their own warnings. * squash! whisper: suppress Windows compiler warnings Move ggml related warnings into ggml. This commit also fixes the indentation and adds a missing whitespace to the if statement.	2025-05-01 20:15:34 +03:00
Xuan-Son Nguyen	8936784f7a	mtmd : add vision support for Mistral Small 3.1 (#13231 ) * convert ok * load ok, missing patch merger * ah sheet it works * update llava/readme * add test * fix test	2025-05-01 17:05:42 +02:00
Xuan-Son Nguyen	13c9a3319b	arg : remove CURLINFO_EFFECTIVE_METHOD (#13228 )	2025-05-01 10:23:25 +02:00
Jared Van Bortel	a70183eb00	llama-model : fix the reported size class for nomic-embed-text-v2-moe (#13223 )	2025-05-01 10:09:41 +03:00
Georgi Gerganov	8d33d740c3	sync : ggml	2025-05-01 10:00:39 +03:00
Diego Devesa	4254bb4951	ggml : fix ggml_gallocr_ptr type (ggml/1205)	2025-05-01 09:58:44 +03:00
Georgi Gerganov	9998540149	cuda : fix unused variable compile warning (whisper/0) ggml-ci	2025-05-01 09:58:44 +03:00
Johannes Gäßler	e1e8e0991f	CUDA: batched+noncont MMQ, refactor bs>1 MoE code (#13199 )	2025-04-30 23:12:59 +02:00
Xuan-Son Nguyen	6f67cf1f48	arg : -hf do not fail if url mismatch (#13219 ) * arg : -hf do not fail if url mismatch * do not return if cannot parse metadata json	2025-04-30 21:29:15 +01:00
ddh0	16a457facd	fix typo: `n_ctx_pre_seq` -> `n_ctx_per_seq` (#13221 )	2025-04-30 21:28:43 +01:00
Xuan-Son Nguyen	3e168bede4	convert : improve model arch handling (#13122 ) * convert : improve model arch handling * use AutoConfig * rm trust_remote_code * Update convert_hf_to_gguf.py * fix self.block_count for vision * fix NomicBertModel	2025-04-30 16:56:24 +02:00
Tatsuya Tanaka	ceda28ef8e	llava : remove duplicate include (#13207 )	2025-04-30 15:25:20 +02:00
Olivier Chafik	3b127c7385	common : add -jf / --json-schema-file flag (#12011 )	2025-04-30 14:52:35 +02:00
Jeff Bolz	e5007a5edf	vulkan: use uint array index to avoid glslang bug (#13193 )	2025-04-30 14:38:37 +02:00
shalinib-ibm	416313773b	ggml : fix ppc64le build (#13176 ) Build fails with compilation error on power pc. This patch fixes the same. Tested with unit tests run via --build <build_dir> && cd <build_dir> && make test Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>	2025-04-30 13:17:08 +02:00
Xuan-Son Nguyen	07c2e2f76c	convert : correct typo image_mean --> image_std (#13208 )	2025-04-30 13:06:15 +02:00
Aaron Teo	44cd8d91ff	feat(ggml-cpu): enable z17 compile (#13182 ) z17 compilation requires GCC 15.1.0 and onwards Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-04-30 10:47:35 +01:00
Xuan-Son Nguyen	5933e6fdc9	arg : allow using -hf offline (#13202 ) * arg : allow using -hf offline * add more comments in code [no ci]	2025-04-30 10:46:32 +02:00
Xuan-Son Nguyen	da84c04d8f	docker : do not build tests (#13204 ) * docker : do not build tests * include "ggml-cpu.h"	2025-04-30 10:44:07 +02:00
xiaofei	a0f7016d17	rpc : fix cache directory initialization (#13188 ) Signed-off-by: xiaofei <hbuxiaofei@gmail.com>	2025-04-30 09:29:22 +03:00
Johannes Gäßler	19e899ce21	scripts: n_depth for compare-llama-bench [no ci] (#13201 )	2025-04-29 23:32:04 +02:00
matteo	e2e1ddb93a	server : Prefilling assistant message in openai compatible API (#13174 ) * Prefilling assistant message in openai compatible API * fixed indentation * fixed code convention * simplify method usage * no more than one assistant message at end of messages * merge checks into prefill code * Update examples/server/utils.hpp --------- Co-authored-by: matteo <matteo@naspc.lan> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-04-29 20:33:10 +02:00
Georgi Gerganov	d9d398f84f	sampling : when top-k <= 0 -> noop (#13173 ) ggml-ci	2025-04-29 20:22:57 +03:00
Alberto Cabrera Pérez	5a63980117	llama-bench: fixed size of fields to correctly map to values (#13183 )	2025-04-29 17:24:36 +02:00
Johannes Gäßler	cdf76586b2	CUDA: fix non-cont. inputs for batched mat mul (#13155 )	2025-04-29 16:00:27 +02:00

1 2 3 4 5 ...

5269 commits