llama.cpp

ver4a/llama.cpp

Fork 0

Commit graph

037259be68

llama : make load error reporting more granular (#5477) Aarni Koskela 2024-02-13 15:24:50 +02:00
263978904c

finetune : rename feed-forward tensors (w1/w2/w3) (#4839) Daniel Bevenius 2024-02-13 14:15:42 +01:00
cf45252a7c

tests : multi-thread the tokenizer tests (#5474) Georgi Gerganov 2024-02-13 15:14:22 +02:00
03bf161eb6

llama : support batched embeddings (#5466) Douglas Hanley 2024-02-13 06:06:58 -06:00
ad014bba97

make: add error message for bad CUDA version (#5444) Johannes Gäßler 2024-02-13 12:38:37 +01:00
49cc1f7d67

bert : add tests + fix quantization (#5475) Georgi Gerganov 2024-02-13 13:01:29 +02:00
99b8b43d7b

tests : disable moe test (#5473) Georgi Gerganov 2024-02-13 11:20:24 +02:00
895407f31b

ggml-quants : fix compiler warnings (shadow variable) (#5472) Kawrakow 2024-02-13 09:07:57 +02:00
099afc6274

llama : fix quantization when tensors are missing (#5423) Georgi Gerganov 2024-02-12 20:14:39 +02:00
df334a1125

swift : package no longer use ggml dependency (#5465) Georgi Gerganov 2024-02-12 19:54:29 +02:00
dbd8828eb0

py : fix persimmon n_rot conversion (#5460) Lee 2024-02-13 01:29:57 +08:00
43fe07c1a4

ggml-sycl: Replace 3d ops with macro (#5458) Abhilash Majumder 2024-02-12 20:22:05 +05:30
4a46d2b792

llava : remove prog parameter from ArgumentParser (#5457) Daniel Bevenius 2024-02-12 09:38:44 +01:00
3b169441df

sync : ggml (#5452) Georgi Gerganov 2024-02-12 09:16:06 +02:00
3bdc4cd0f5

CUDA: mul_mat_vec_q tiling, refactor mul mat logic (#5434) Johannes Gäßler 2024-02-11 19:08:39 +01:00
2891c8aa9a

Add support for BERT embedding models (#5423) Douglas Hanley 2024-02-11 10:21:38 -06:00
97a336507e flake.lock: Update github-actions[bot] 2024-02-11 00:17:31 +00:00
c88c74f967

vulkan: only use M-sized matmul on Apple GPUs (#5412) Sergio López 2024-02-11 15:12:00 +01:00
a803333a4e

common : use enums for sampler types (#5418) Alexey Parfenov 2024-02-11 13:43:31 +00:00
684780141a

server : allow to specify tokens as strings in logit_bias (#5003) Alexey Parfenov 2024-02-11 13:38:14 +00:00
85910c5b30

main : ctrl+C print timing in non-interactive mode (#3873) Georgi Gerganov 2024-02-11 15:35:50 +02:00
139b62a839

common : fix compile warning Georgi Gerganov 2024-02-11 15:33:43 +02:00
0f2411f154

ggml : fix compile warnings (unused vars) (#4966) Georgi Gerganov 2024-02-11 15:33:01 +02:00
a07d0fee1f

ggml : add mmla kernels for quantized GEMM (#4966) snadampal 2024-02-11 07:22:33 -06:00
e4640d8fdf

lookup: add print for drafting performance (#5450) Johannes Gäßler 2024-02-11 12:44:51 +01:00
907e08c110

server : add llama2 chat template (#5425) Xuan Son Nguyen 2024-02-11 11:16:22 +01:00
f026f8120f

metal : use autoreleasepool to avoid memory leaks (#5437) Ian Bull 2024-02-10 02:53:28 -08:00
cd9aea63b5

scripts : update sync scripts with new backends Georgi Gerganov 2024-02-10 09:53:05 +02:00
43b65f5eb8

sync : ggml Georgi Gerganov 2024-02-10 09:30:36 +02:00
4633d93af0

ggml : add abort_callback for cpu backend (ggml/725) Michael Podvitskiy 2024-02-09 10:42:27 +01:00
4b7b38bef5

vulkan: Set limit for task concurrency (#5427) Neuman Vong 2024-02-10 05:30:19 +11:00
e00d2a62dd

llava : add requirements.txt and update README.md (#5428) Daniel Bevenius 2024-02-09 14:00:59 +01:00
7c777fcd5d

server : fix prompt caching for repeated prompts (#5420) Riley Stewart 2024-02-09 02:49:49 -08:00
e5ca3937c6

llama : do not cap thread count when MoE on CPU (#5419) Paul Tsochantaris 2024-02-09 10:48:06 +00:00
e4124c2477

readme : add JavaScript/Wasm repo (#5415) Marko Tasic 2024-02-09 11:17:00 +01:00
b2f87cb64d

ggml : fix error C2078: too many initializers for MSVC ARM64 (#5404) Michael Podvitskiy 2024-02-09 10:56:43 +01:00
44fbe34360

Fix Vulkan crash on APUs with very little device memory (#5424) 0cc4m 2024-02-09 06:52:33 +01:00
8e6a9d2de0

CUDA: more warps for mmvq on NVIDIA (#5394) Johannes Gäßler 2024-02-08 21:56:40 +01:00
41f308f58e

llama : do not print "offloading layers" message in CPU-only builds (#5416) slaren 2024-02-08 21:33:03 +01:00
6e99f2a04f

Fix f16_sycl cpy call from Arc (#5411) Abhilash Majumder 2024-02-08 22:39:10 +05:30
ff4ff05c5f

llava : add missing .py, and fix paths in README.md (#5414) Daniel Bevenius 2024-02-08 15:20:03 +01:00
b7b74cef36

fix trailing whitespace (#5407) Johannes Gäßler 2024-02-08 11:36:54 +01:00
4aa43fab56

llama : fix MiniCPM (#5392) runfuture 2024-02-08 18:36:19 +08:00
a6e514a85f

llava: fix typo/formatting in README.md (#5405) Daniel Bevenius 2024-02-08 09:58:19 +01:00
26d4efd11e

sampling: fix top_k <= 0 (#5388) Johannes Gäßler 2024-02-08 09:46:30 +01:00
8504d2d0da

tests : .gitignore obj files Georgi Gerganov 2024-02-08 09:46:47 +02:00
c4fbb6717c

CMAKE_OSX_ARCHITECTURES for MacOS cross compilation (#5393) Michael Podvitskiy 2024-02-07 22:39:23 +01:00
8c933b70c2

fix typo in readme (#5399) Ebey Abraham 2024-02-07 21:11:30 +00:00
b906596bb7

Add Ava in the list of llama.cpp UIs (#4362) Kamil Tomšík 2024-02-07 19:44:52 +01:00
aa7ab99be2

CUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (#5386) Johannes Gäßler 2024-02-07 12:40:26 +01:00
10afa6f1d1

[SYCL] update install make by w64devkit (#5297) Neo Zhang Jianyu 2024-02-07 18:16:55 +08:00
0ef46da632

llava-cli : always tokenize special tokens (#5382) Xiao-Yong Jin 2024-02-07 02:17:25 -06:00
ee1628bdfe

Basic Vulkan Multi-GPU implementation (#5321) 0cc4m 2024-02-07 07:54:50 +01:00
ed0bf32290

readme : modernize (#5379) Eve 2024-02-07 06:21:30 +00:00
9a697d842b

readme : update ui list (#5354) Ben Williams 2024-02-06 22:16:48 -08:00
316c7faf77

llama : add MiniCPM support (#5346) runfuture 2024-02-07 14:15:56 +08:00
f3e2b4fa3f

server : update /props with "total_slots" value (#5373) Justin Parker 2024-02-07 01:15:19 -05:00
f68664ac24

convert : fix TypeError on GPT-2 vocab.json (#5288) Sang-Kil Park 2024-02-07 13:28:00 +09:00
213d1439fa

server : remove model.json endpoint (#5371) Alexey Parfenov 2024-02-06 18:08:38 +00:00
17c97fb062

CUDA: mul_mat_vec_q max. batch size 8 -> 4 (#5370) Johannes Gäßler 2024-02-06 18:43:06 +01:00
b08f22c882

Update README.md (#5366) Kawrakow 2024-02-06 19:00:16 +02:00
f57fadc009

Slight quantization improvement for Q4_K and Q5_K (#5361) Kawrakow 2024-02-06 17:28:02 +02:00
2e9c0bd6b3

readme : add phi, orion 14b, internlm2, and yi-VL to readme (#5362) BarfingLemurs 2024-02-06 09:06:48 -05:00
2c516611f1

CUDA: mul_mat_vec_q for batch sizes > 1 (#5351) Johannes Gäßler 2024-02-06 14:44:06 +01:00
8a79c591de

server : include total "num_slots" in props endpoint (#5349) Justin Parker 2024-02-06 04:20:59 -05:00
31e7903221

server : add dynatemp_range and dynatemp_exponent (#5352) Michael Coppola 2024-02-06 04:20:00 -05:00
4ffc7a17d4

server : various fixes for the prompt field in /completion (#5300) Niall Coates 2024-02-06 08:16:23 +00:00
906cff55c2

py : handle byte tokens in get_token_type (#5341) Georgi Gerganov 2024-02-06 07:47:22 +02:00
098f6d737b

make: Use ccache for faster compilation (#5318) Johannes Gäßler 2024-02-05 19:33:00 +01:00
78b00dda6c

README: updated introduction (#5343) Johannes Gäßler 2024-02-05 15:55:10 +01:00
c6b395535a

ggml : make use of ggml-quants.h possible in C++ code (#5338) Kawrakow 2024-02-05 14:09:47 +02:00
abb61944a5

ggml : avoid duplicating function calls using MIN/MAX macros (#5325) Dr. Tom Murphy VII Ph.D 2024-02-05 06:13:57 -05:00
89503dcb5f

iq3_xxs: quards for the no-imatrix situation (#5334) Kawrakow 2024-02-05 12:32:27 +02:00
7e1ae372f3

py : fix internlm2-hf convert to gguf (#5305) Guoteng 2024-02-05 17:04:06 +08:00
6fdfa2ecc6

iq2_xxs: tune quantization (#5320) Kawrakow 2024-02-05 10:46:06 +02:00
a2d60c9158

server : allow to get default generation settings for completion (#5307) Alexey Parfenov 2024-02-05 08:10:22 +00:00
e6f8177532

common : add dynamic temperature parameters to main example cli (#5295) l3utterfly 2024-02-05 17:00:47 +09:00
30679d438d

scripts : fix typos, cleanup (#5303) Georgi Gerganov 2024-02-05 09:48:03 +02:00
4be04c8965

scripts : add non-interactive server-llm.sh (#5303) Нияз Гарифзянов 2024-02-05 10:43:57 +03:00
5d55b0cd82

readme : add CodeShell models to the supported models list (#5330) chiranko 2024-02-05 15:41:38 +08:00
4833ac209d

[SYCL] Fix cpy with dims of 3 (#5289) AidanBeltonS 2024-02-05 07:08:24 +00:00
9392ebd49e flake.lock: Update github-actions[bot] 2024-02-04 00:17:24 +00:00
5ed26e1fc9

Adding some imatrix tools (#5302) Kawrakow 2024-02-04 10:39:58 +02:00
277fad30c6

cmake : use set() for LLAMA_WIN_VER (#5298) Welby Seely 2024-02-03 23:18:51 -05:00
3c0d25c475

make: add nvcc info print (#5310) Johannes Gäßler 2024-02-03 20:15:13 +01:00
3cc5ed353c

make: fix nvcc optimization flags for host code (#5309) Johannes Gäßler 2024-02-03 20:14:59 +01:00
60ecf099ed add Vulkan support to Nix flake Martin Schwaighofer 2024-01-28 12:59:43 +01:00
e920ed393d

Vulkan Intel Fixes, Optimizations and Debugging Flags (#5301) 0cc4m 2024-02-03 18:15:00 +01:00
52bb63c708

refactor : switch to emplace_back to avoid extra object (#5291) Michael Klimenko 2024-02-03 12:23:37 +01:00
1ec3332ade

YaRN : store rope scaling type as int32_t in memory (#5285) Jared Van Bortel 2024-02-03 06:22:06 -05:00
6a66c5071a

readme : add tenere in the ui tools list (#5284) BADR 2024-02-03 12:20:26 +01:00
a305dba8ff

Fix im2col with 32fp (#5286) AidanBeltonS 2024-02-03 08:11:37 +00:00
191221178f

perplexity : fix KL divergence calculations on Windows (#5273) kalomaze 2024-02-02 08:15:30 -06:00
e437b37fd0

scripts : parse wtype in server-llm.sh (#5167) Georgi Gerganov 2024-02-02 14:23:40 +02:00
2d40085c26

py : add check for '.attn.masked_bias' layers to GPT2model (#5281) Mirror Azure 2024-02-02 14:39:09 +03:00
b05102fe8c

Tidy ggml-sycl (#5261) AidanBeltonS 2024-02-02 08:39:48 +00:00
6b91b1e0a9

docker : add build for SYCL, Vulkan + update readme (#5228) Xuan Son Nguyen 2024-02-02 08:56:31 +01:00
e805f0fa99

[SYCL] get MAX_MEM_ALLOC from device property (#5270) Meng, Hengyu 2024-02-02 15:54:14 +08:00
af3ba5d946

[SYCL] update guide of SYCL backend (#5254) Neo Zhang Jianyu 2024-02-02 15:53:27 +08:00
e1e721094d

llama : fix memory leak in llama_batch_free (#5252) Ian Bull 2024-02-01 23:20:13 -08:00