Commit graph

  • 31ec3993f6
    ggml : add GGML_CUDA_USE_GRAPHS option, restore GGML_CUDA_FORCE_CUBLAS (cmake) (#8140) slaren 2024-06-26 21:34:14 +02:00
  • c7ab7b612c
    make : fix missing -O3 (#8143) slaren 2024-06-26 20:20:22 +02:00
  • f2d48fffde
    sync : ggml Georgi Gerganov 2024-06-26 19:39:19 +03:00
  • 4713bf3093
    authors : regen Georgi Gerganov 2024-06-26 19:36:44 +03:00
  • 0e814dfc42
    devops : remove clblast + LLAMA_CUDA -> GGML_CUDA (#8139) Georgi Gerganov 2024-06-26 19:32:07 +03:00
  • a95631ee97
    readme : update API notes Georgi Gerganov 2024-06-26 19:26:13 +03:00
  • f3f65429c4
    llama : reorganize source code + improve CMake (#8006) Georgi Gerganov 2024-06-26 18:33:02 +03:00
  • 8854044561
    Clarify default MMQ for CUDA and LLAMA_CUDA_FORCE_MMQ flag (#8115) Isaac McFadyen 2024-06-26 02:29:28 -04:00
  • c8771ab5f8
    CUDA: fix misaligned shared memory read (#8123) Johannes Gäßler 2024-06-26 08:28:02 +02:00
  • 494165f3b6
    llama : extend llm_build_ffn() to support _scale tensors (#8103) Eddie-Wang 2024-06-26 14:27:46 +08:00
  • 9b2f16f805
    json: better support for "type" unions (e.g. nullable arrays w/ typed items) (#7863) Olivier Chafik 2024-06-26 01:46:35 +01:00
  • 6777c544bd
    json: fix additionalProperties, allow space after enum/const (#7840) Olivier Chafik 2024-06-26 01:45:58 +01:00
  • 163d50adaf
    fixes #7999 (adds control vectors to all build_XXX() functions in llama.cpp [needs testing] (#8060) jukofyork 2024-06-25 21:47:40 +01:00
  • 6fcbf68235
    llama : implement Unigram tokenizer needed by T5 and FLAN-T5 model families (#5763) fairydreaming 2024-06-25 21:14:35 +02:00
  • e6bf007744
    llama : return nullptr from llama_grammar_init (#8093) Daniel Bevenius 2024-06-25 21:07:28 +02:00
  • 84631fe150
    json: support integer minimum, maximum, exclusiveMinimum, exclusiveMaximum (#7797) Olivier Chafik 2024-06-25 20:06:20 +01:00
  • dd047b476c
    disable docker CI on pull requests (#8110) slaren 2024-06-25 19:20:06 +02:00
  • 925c30956d
    Add healthchecks to llama-server containers (#8081) joecryptotoo 2024-06-25 08:13:27 -07:00
  • c8ad35955a
    Gguf dump start data offset via --data-offset and some extra refactor (#8054) Brian 2024-06-25 22:03:25 +10:00
  • 49c03c79cd
    cvector: better prompt handling, add "mean vector" method (#8069) Xuan Son Nguyen 2024-06-25 13:59:54 +02:00
  • 48e6b92cc3
    Add chat template support for llama-cli (#8068) Xuan Son Nguyen 2024-06-25 13:56:49 +02:00
  • 3791ad2193
    SimpleChat v3.1: Boolean chat request options in Settings UI, cache_prompt (#7950) HanishKVC 2024-06-25 16:57:35 +05:30
  • f702a90e24
    Update control vector help (#8104) HatsuneMikuUwU33 2024-06-25 10:44:48 +02:00
  • 083bacce14
    [SYCL] Re-enabled mul_mat_batched_sycl (#8095) Meng, Hengyu 2024-06-25 10:19:20 +08:00
  • 2df373ac40
    CUDA: fix matrix multiplication algorithm choice (#8102) Johannes Gäßler 2024-06-25 01:22:33 +02:00
  • 3b099bcd9c
    CUDA: fix MMQ writeback for int8 tensor cores (#8100) Johannes Gäßler 2024-06-24 22:15:33 +02:00
  • a818f3028d
    CUDA: use MMQ instead of cuBLAS by default (#8075) Johannes Gäßler 2024-06-24 17:43:42 +02:00
  • d62e4aaa02
    gguf-py : fix tensor groups for encoder-decoder models in gguf-dump.py (#8090) fairydreaming 2024-06-24 14:13:39 +02:00
  • 9a590c8226
    CUDA: optimize MMQ int8 tensor core performance (#8062) Johannes Gäßler 2024-06-24 12:41:23 +02:00
  • 52fc8705a0
    Option to split during conversion (#6942) Christian Zhou-Zheng 2024-06-24 05:42:03 -04:00
  • 8cb508d0d5
    disable publishing the full-rocm docker image (#8083) slaren 2024-06-24 07:36:11 +02:00
  • 646ef4a9cf
    embedding : more cli arguments (#7458) Yann Follet 2024-06-24 13:30:24 +08:00
  • de0d6a68ac
    gguf-py, convert-hf : model conversion support for T5 and FLAN-T5 model variants (#5763) fairydreaming 2024-06-24 07:06:05 +02:00
  • 95f57bb5d5
    ggml : remove ggml_task_type and GGML_PERF (#8017) slaren 2024-06-24 03:07:59 +02:00
  • e112b610a1
    llama : add support for BitnetForCausalLM (#7931) Eddie-Wang 2024-06-24 02:27:57 +08:00
  • 6a2f298bd7
    server : fix JSON-Scheme typo (#7975) Aarni Koskela 2024-06-23 18:03:08 +03:00
  • 11318d9aa1
    Fix typo in llama_set_embeddings comment (#8077) Daniel Bevenius 2024-06-23 15:39:45 +02:00
  • b6b9a8e606
    fix CI failures (#8066) slaren 2024-06-23 13:14:45 +02:00
  • 45c0e2e4c1
    Refactor Vulkan backend to allow multiple contexts (#7961) 0cc4m 2024-06-23 10:21:25 +02:00
  • b5a5f34efa
    Removing extra blank lines that were breaking Lint. (#8067) Clint Herron 2024-06-22 14:28:18 -04:00
  • 3e58b0ee35
    cvector: fix CI + correct help message (#8064) Xuan Son Nguyen 2024-06-22 18:11:30 +02:00
  • adf480c3ab
    cvector-generator: Moe Moe Fixie-Fixie for Lots of Formats~! ♡(ᐢ ᴥ ᐢ)♡ (#8052) HatsuneMikuUwU33 2024-06-22 17:19:37 +02:00
  • 3aa184a8c7
    convert-hf : change assert to exception (#8015) 0xspringtime 2024-06-22 09:37:41 -04:00
  • 5b48cd53a8
    Update llama-quantize ppl/file size output from LLaMA-v1 to Llama-3 values (#8058) ddh0 2024-06-22 07:16:10 -06:00
  • c5a8d4b749
    JSON Schema to GBNF integration tests (#7790) Clint Herron 2024-06-21 23:18:36 -04:00
  • 557b653dc9
    vulkan: detect multiple devices by deviceUUID instead of deviceID (#8022) k.h.lai 2024-06-21 16:28:20 +08:00
  • 7d5e8777ae
    ggml : AVX IQ quants (#7845) Eve 2024-06-21 05:57:36 +00:00
  • a927b0f3dd
    llama : optimize long word tokenization with WPM (#8034) Georgi Gerganov 2024-06-21 08:51:28 +03:00
  • 80ea089d77
    llama : allow pooled embeddings on any model (#7477) Douglas Hanley 2024-06-21 00:38:22 -05:00
  • 0e64591e82
    swiftui : enable stream updating (#7754) Shuichi Tsutsumi 2024-06-21 14:30:58 +09:00
  • b1ef562bc1
    requirements : Bump torch and numpy for python3.12 (#8041) Hamdoud Hakem 2024-06-20 21:01:15 +01:00
  • 17b291a6a5
    convert-hf : Fix the encoding in the convert-hf-to-gguf-update.py (#8040) Hamdoud Hakem 2024-06-20 20:59:59 +01:00
  • abd894ad96
    common: fix warning (#8036) Johannes Gäßler 2024-06-20 16:40:13 +02:00
  • de391e4c80
    [SYCL] Fix windows build and inference (#8003) luoyu-intel 2024-06-20 13:19:05 +00:00
  • d50f8897a7
    CUDA: stream-k decomposition for MMQ (#8018) Johannes Gäßler 2024-06-20 14:39:21 +02:00
  • 2075a66a96
    metal : fix ggml_metal_supports_op for BF16 (#8021) Michael de Gans 2024-06-19 22:32:01 -07:00
  • ba58993152
    server : fix smart slot selection (#8020) sasha0552 2024-06-19 23:57:10 +00:00
  • a7854743c5
    un-ignore build-info.cmake and build-info.sh (#7996) Michael de Gans 2024-06-19 13:10:42 -07:00
  • 9c77ec1d74
    ggml : synchronize threads using barriers (#7993) slaren 2024-06-19 15:04:15 +02:00
  • a04a953cab
    codecov : remove (#8004) Georgi Gerganov 2024-06-19 13:04:36 +03:00
  • 623494a478
    [SYCL] refactor (#6408) Meng, Hengyu 2024-06-19 09:11:51 +08:00
  • 37bef89433
    tokenizer : BPE fixes (#7530) jaime-m-p 2024-06-18 18:40:52 +02:00
  • 91c188d6c2
    Only use FIM middle token if it exists (#7648) Sigbjørn Skjæret 2024-06-18 14:19:45 +02:00
  • 84f6de17f6
    Fix no gcc pragma on Windows (#7751) jojorne 2024-06-18 09:18:32 -03:00
  • 61665277af
    Allow compiling with CUDA without CUDA runtime installed (#7989) Ulrich Drepper 2024-06-18 14:00:14 +02:00
  • b96f9afb0d
    chore: clean useless beam search param (#7985) Frank Mai 2024-06-18 15:11:40 +08:00
  • 1193778105
    readme : update UI list (#7943) Abheek Gulati 2024-06-17 23:57:41 -07:00
  • 5326bcceeb
    ggml : sync Georgi Gerganov 2024-06-18 09:50:45 +03:00
  • e6ecc2be47
    whisper : use ggml_backend_sched (whisper/2239) Georgi Gerganov 2024-06-18 09:37:20 +03:00
  • a94e6ff877
    update: support Qwen2-57B-A14B (#7835) Ștefan-Gabriel Muscalu 2024-06-17 22:08:46 +03:00
  • 5b6da18750
    Make updates to type cast based on compiler instead of OS (#7851) Srihari-mcw 2024-06-17 23:53:17 +05:30
  • 7c26775adb
    llama : disable FA if KV head size do not match (#7982) Georgi Gerganov 2024-06-17 19:40:01 +03:00
  • b473e95084
    Add Nix and Flox install instructions (#7899) Bryan Honof 2024-06-17 17:37:55 +02:00
  • 99052cd227
    sched : offload_op also requires supports_op (#7977) slaren 2024-06-17 16:51:42 +02:00
  • c637fcd34d
    fix: divide 0 exception in mamba (#7932) Frank Mai 2024-06-17 22:11:08 +08:00
  • 6a2f0b3474
    Implement non-mapped async IO for CUDA on Windows. (#7896) Markus Tavenrath 2024-06-17 16:10:15 +02:00
  • 21be9cab94
    rpc : fix load/store misaligned addresses (#7948) Georgi Gerganov 2024-06-17 11:09:20 +03:00
  • 006167aaf6
    gguf-dump.py: add --markdown dump output (#7853) Brian 2024-06-17 15:25:20 +10:00
  • df68d4fa5d
    [SYCL] Update README-sycl.md for Chapter "Recommended release" and "News" (#7946) Neo Zhang 2024-06-17 11:17:07 +08:00
  • 43b35e38ba
    Add support for sqrt on CUDA (#7953) Calvin Laurenson 2024-06-16 15:23:04 -07:00
  • 19b7a836f6
    cuda : fix bounds check for src0 rows in MMVQ kernel (whisper/2231) Georgi Gerganov 2024-06-11 17:39:01 +03:00
  • b5fcf8ef5c
    ggml : fix and optimize ppc64le (ggml/849) Hong Bo PENG 2024-06-16 16:53:11 +08:00
  • 398105ff43
    ggml : remove duplicate include of ggml-common.h (ggml/853) Daniel Bevenius 2024-06-16 10:51:18 +02:00
  • bc6c457fa3
    flake.lock: Update (#7951) Georgi Gerganov 2024-06-16 19:16:21 +03:00
  • 52399254b3
    unicode : avoid char32_t (#7957) Georgi Gerganov 2024-06-16 14:51:40 +03:00
  • 6fe1c62741
    readme : update UI list [no ci] (#7958) hopkins385 2024-06-16 13:51:18 +02:00
  • cddaf028ad
    ggml : fix handling of zero blocks in IQ quants (#7955) Georgi Gerganov 2024-06-16 14:50:12 +03:00
  • c8a82194a8
    github : update pr template Georgi Gerganov 2024-06-16 10:46:51 +03:00
  • 7c7836d9d4
    Vulkan Shader Refactor, Memory Debugging Option (#7947) 0cc4m 2024-06-16 07:17:31 +02:00
  • 0c7b3595b9
    Add cvector-generator example (#7514) Xuan Son Nguyen 2024-06-15 18:53:40 +02:00
  • 7b2f4a7d19
    [SYCL] remove global variables (#7710) Meng, Hengyu 2024-06-15 14:05:10 +08:00
  • f8ec8877b7
    ci : fix macos x86 build (#7940) olexiyb 2024-06-14 20:28:34 +03:00
  • 76d66ee0be
    CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (#7921) Johannes Gäßler 2024-06-14 18:41:49 +02:00
  • 66ef1ceedf
    metal : utilize max shared memory for mul_mat_id (#7935) Georgi Gerganov 2024-06-14 17:14:09 +03:00
  • e65bbf606c
    llama-bench : fix RPC indication (#7936) Radoslav Gerganov 2024-06-14 16:47:41 +03:00
  • 6fcd1331ef
    llama : more checks before assuming FIM tokens (#7644) Sigbjørn Skjæret 2024-06-14 12:20:04 +02:00
  • 41b9260f18
    convert : add Poro-34B-chat tokenizer support (#7713) Elaine 2024-06-14 13:16:49 +03:00
  • 172c825684
    rpc : fix ggml_backend_rpc_supports_buft() (#7918) Radoslav Gerganov 2024-06-13 15:18:44 +03:00
  • a55eb1bf0f
    readme : Remove outdated instructions from README.md (#7914) [no ci] Galunid 2024-06-13 09:42:41 +02:00
  • f578b86b21
    move BLAS to a separate backend (#6210) slaren 2024-06-13 03:11:35 +02:00