Commit graph

  • 1d41d6f7c2
    nix: static build (#5814) hutli 2024-03-05 02:33:08 +01:00
  • 29ae62d2ae
    llama : fix embeddings (#5796) Georgi Gerganov 2024-03-04 22:31:20 +02:00
  • e0843afe1b
    flake : fix Georgi Gerganov 2024-03-04 21:50:50 +02:00
  • a1c6d96ed8 ggml : fix unknown status (#0) Georgi Gerganov 2024-03-04 20:53:27 +02:00
  • efd8533ef8 sync : ggml Georgi Gerganov 2024-03-04 11:06:39 +02:00
  • 9fa2627347 ggml : introduce ggml_status (ggml/750) Michael Podvitskiy 2024-03-04 10:05:42 +01:00
  • fe52be11e3
    cmake : handle cases where git index is not found in .git (#5844) Dane Madsen 2024-03-05 05:26:55 +11:00
  • 6d341ab6c5
    speculative : implement stochastic speculative sampling (#5625) Minsoo Cheong 2024-03-05 03:24:00 +09:00
  • 4ffcdce2ff
    add alias for chat template (#5858) Xuan Son Nguyen 2024-03-04 12:22:08 +01:00
  • a0fc62661f
    sync : ggml Georgi Gerganov 2024-03-04 10:40:04 +02:00
  • 7d43c585dc
    add some new ops, fix some operators and add batch operations to certain operators. (ggml/747) leejet 2024-03-03 20:23:52 +08:00
  • 82f3e668ad
    common : use LLAMA_DEFAULT_SEED (#5855) DAN™ 2024-03-04 03:08:19 -05:00
  • 5a51cc1bb4
    main : support special tokens as reverse/anti prompt (#5847) DAN™ 2024-03-04 02:57:20 -05:00
  • 67be2ce101
    cuda : fix data race in soft max (#5853) slaren 2024-03-03 14:26:18 +01:00
  • 231ae28f07
    readme : add API changes section Georgi Gerganov 2024-03-03 12:44:03 +02:00
  • 475df1d6cf
    llama : allow for user specified embedding pooling type (#5849) Douglas Hanley 2024-03-03 04:40:27 -06:00
  • 87c2e8b279
    gguf-dump : support i-quants (#5841) Nindaleth 2024-03-03 09:43:42 +01:00
  • de9692a7d2
    llama : fix llama_copy_state_data with fragmented KV cache (#5840) compilade 2024-03-03 03:41:55 -05:00
  • e6029348e8
    ci : schedule slow server tests only on Release or on demand (#5839) Pierrick Hymbert 2024-03-03 09:35:23 +01:00
  • 8ef969afce
    server : init http requests thread pool with --parallel if set (#5836) Pierrick Hymbert 2024-03-03 08:48:36 +01:00
  • fa974646e1
    flake.lock: Update (#5842) Georgi Gerganov 2024-03-03 06:11:31 +02:00
  • 9731134296
    server: tests: passkey challenge / self-extend with context shift demo (#5832) Pierrick Hymbert 2024-03-02 22:00:14 +01:00
  • 4a6e2d6142
    llama : add abort_callback to interrupt computation (#5409) Michael Podvitskiy 2024-03-02 20:52:25 +01:00
  • 494c870326
    ggml : fix IQ3_S AVX implementation (#5834) Georgi Gerganov 2024-03-02 20:00:49 +02:00
  • 4d4d2366fc
    convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (#5821) Jared Van Bortel 2024-03-02 12:27:26 -05:00
  • c7a0ad8ec9
    convert-hf : make model class definitions self-contained (#5825) Jared Van Bortel 2024-03-02 12:21:47 -05:00
  • bbde6eb256
    ggml : IQ3_S improvements (#5829) Kawrakow 2024-03-02 17:00:51 +02:00
  • ef2cd694c4
    scripts : add pod-llama.sh Georgi Gerganov 2024-03-02 16:54:08 +02:00
  • 6c32d8c7ad
    llama : refactor internal quantization functions (#5830) Xuan Son Nguyen 2024-03-02 15:19:09 +01:00
  • 802da0091b
    llama : fix segfault from unknown model arch name (#5820) compilade 2024-03-02 08:42:56 -05:00
  • 715641391d
    Support multiple GPUs (split mode) on SYCL backend (#5806) Neo Zhang Jianyu 2024-03-02 19:49:30 +08:00
  • 9bf297a02b
    workflows : remove nocleanup arg for check-requirements.sh (#5826) crasm 2024-03-02 00:11:06 -05:00
  • cb5e8f7fc4
    build(nix): Introduce flake.formatter for nix fmt (#5687) Tushar 2024-03-02 04:48:26 +05:30
  • da3b9ba2b7
    convert-hf-to-gguf : require einops for InternLM2ForCausalLM (#5792) nold 2024-03-01 22:51:12 +01:00
  • c29af7e225
    llama : add StarCoder2 support (#5795) Sourab Mangrulkar 2024-03-02 01:00:46 +05:30
  • 38d16b1426
    server : remove api_like_OAI.py proxy script (#5808) Georgi Gerganov 2024-03-01 20:00:58 +02:00
  • c2224f003b
    ggml-vulkan: fix VULKAN_CHECK_RESULTS flag, which was previously broken (#5813) ddpasa 2024-03-01 18:00:00 +01:00
  • e743386728
    gemma : fix bfloat16 -> float16 conversion issue (#5810) kunal-vaishnavi 2024-03-01 06:08:08 -08:00
  • f49a535686
    common : fix flag --logits-all to --all-logits (#5805) Miwa / Ensan 2024-03-01 22:48:56 +09:00
  • 3ab8b3a92e
    llama : cleanup unused mmq flags (#5772) Pierrick Hymbert 2024-03-01 12:39:06 +01:00
  • 9600d59e01
    unicode : switch to multimap based nfd_map (#5799) Douglas Hanley 2024-03-01 03:15:36 -06:00
  • 5cb02b4a01
    server: allow to override threads server pool with --threads-http (#5794) Pierrick Hymbert 2024-03-01 10:08:08 +01:00
  • 6ea0f010ff
    ci : add Ubuntu 22 Vulkan CI run (#5789) Eve 2024-03-01 08:54:53 +00:00
  • f105471ef6
    server : fix newlines in help (#5785) Georgi Gerganov 2024-03-01 09:59:43 +02:00
  • 38d1521608
    [SYCL] Use batched mul_mat pathway (#5591) AidanBeltonS 2024-03-01 07:36:47 +00:00
  • 052051d8ae
    Server: normalize naming (#5779) Xuan Son Nguyen 2024-02-29 21:42:11 +01:00
  • d5ab29757e
    llama : constified llama_set_state_data's src (#5774) Marcus Dunn 2024-02-29 00:17:23 -08:00
  • 87c91c0766
    ci : reduce 3b ppl chunks to 1 to avoid timeout (#5771) Georgi Gerganov 2024-02-28 21:44:21 +02:00
  • 317709b2a8
    make portability_enumeration_ext apple only (#5757) Eve 2024-02-28 19:33:37 +00:00
  • 08c5ee87e4
    llama : remove deprecated API (#5770) Georgi Gerganov 2024-02-28 18:43:38 +02:00
  • 78aacf3634
    awq-py : remove (#5768) Georgi Gerganov 2024-02-28 17:36:53 +02:00
  • 8c0e8f4e73
    sync : ggml Georgi Gerganov 2024-02-28 11:17:32 +02:00
  • 2774b0c974
    add google magika inference example (ggml/748) slaren 2024-02-25 20:41:35 +01:00
  • 5f70671856
    Introduce backend GUIDs (ggml/743) UEXTM.com 2024-02-24 11:27:36 -05:00
  • a693bea1e6
    server : hit Ctrl+C twice to exit (#5734) Xuan Son Nguyen 2024-02-28 09:55:37 +01:00
  • adcb12a9ba
    llama : fix non-quantization of expert gating tensors (#5754) compilade 2024-02-28 03:52:56 -05:00
  • 177628bfd8
    llama : improve BERT tokenization (#5740) Douglas Hanley 2024-02-28 02:51:11 -06:00
  • 6c4416868d
    readme : add link to LLaVA 1.6 models (#5758) Daniel Bevenius 2024-02-28 09:39:39 +01:00
  • efc72253f7
    server : add "/chat/completions" alias for "/v1/...` (#5722) Jorge A 2024-02-28 01:39:15 -07:00
  • 7c4263d426
    ggml : make i-quants work with super-blocks of 64 (CPU,Metal) (#5760) Kawrakow 2024-02-28 10:37:02 +02:00
  • cb49e0f8c9
    Attempt to fix android build (#5752) Kawrakow 2024-02-27 19:16:49 +02:00
  • 0becb22ac0
    IQ4_XS: a 4.25 bpw quantization (#5747) Kawrakow 2024-02-27 16:34:24 +02:00
  • c24a2a6e60
    cuda : replace remaining shfl_xor with calls to warp_reduce functions (#5744) Engininja2 2024-02-27 07:22:45 -06:00
  • 1f30b7a9f1
    ggml-quants : fix avx2 iq1_s vec_dot when compiled with gcc (#5742) Engininja2 2024-02-27 06:50:18 -06:00
  • 9d533a77d0
    llama : fix defrag bugs + add parameter (#5735) Georgi Gerganov 2024-02-27 14:35:51 +02:00
  • cbbd1efa06
    Makefile: use variables for cublas (#5689) le.chang 2024-02-27 10:03:06 +08:00
  • b11a93df41
    fix server hangs on empty prompt (#5733) Xuan Son Nguyen 2024-02-26 23:15:48 +01:00
  • a33e6a0d2a
    Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range (#5721) Kawrakow 2024-02-26 18:28:38 +02:00
  • 47bb7b48c7
    CUDA: fix DEBUG_CUDA_MALLOC (#5729) Johannes Gäßler 2024-02-26 15:36:38 +01:00
  • c4d7f81786
    readme : update ui list (#5731) Artem 2024-02-26 17:15:28 +03:00
  • e849078c6e
    [SYCL] Add support for soft_max ALiBi (#5639) AidanBeltonS 2024-02-26 14:02:11 +00:00
  • 67fd33132f
    unicode : reuse iterator (#5726) Georgi Gerganov 2024-02-26 14:02:12 +02:00
  • 4804215cb8
    server: CI fix trailing space (#5728) Pierrick Hymbert 2024-02-26 11:41:34 +01:00
  • 8a533f0d90
    server: CI tests reduce build matrix (#5725) Pierrick Hymbert 2024-02-26 09:56:10 +01:00
  • 269de86ba0
    llama : fix Gemma rope type (#5691) Georgi Gerganov 2024-02-26 08:30:17 +02:00
  • c393733988 flake.lock: Update github-actions[bot] 2024-02-25 00:17:11 +00:00
  • e3965cf35a
    server: tests - slow inference causes timeout on the CI (#5715) Pierrick Hymbert 2024-02-25 22:48:33 +01:00
  • 8b350356b2
    server: docs - refresh and tease a little bit more the http server (#5718) Pierrick Hymbert 2024-02-25 21:46:29 +01:00
  • bf08e00643
    llama : refactor k-shift implementation + KV defragmentation (#5691) Georgi Gerganov 2024-02-25 22:12:24 +02:00
  • f7625019c5
    server : fix crash when system prompt is bigger than batch size (#5714) compilade 2024-02-25 13:43:50 -05:00
  • abbabc5e51
    ggml-quants : provide ggml_vqtbl1q_u8 for 64bit compatibility (#5711) Radosław Gryta 2024-02-25 19:43:00 +01:00
  • f1a98c5254
    make : fix nvcc version is empty (#5713) kwin1412 2024-02-26 00:46:49 +08:00
  • 7d548a1827
    readme : add Msty to UI list (#5618) Ashok Gelal 2024-02-25 10:57:34 -05:00
  • 930b178026
    server: logs - unified format and --log-format option (#5700) Pierrick Hymbert 2024-02-25 13:50:32 +01:00
  • d52d7819b8
    server: concurrency fix + monitoring - add /metrics prometheus compatible endpoint (#5708) Pierrick Hymbert 2024-02-25 13:49:43 +01:00
  • 1289408817
    cmake : fix compilation for Android armeabi-v7a (#5702) Radosław Gryta 2024-02-25 11:53:11 +01:00
  • ab336a9d5e
    code : normalize enum names (#5697) Georgi Gerganov 2024-02-25 12:09:09 +02:00
  • 69917dfa55
    py : fix StableLM conversion after config.json changes (#5703) Anas Ahouzi 2024-02-25 10:54:04 +01:00
  • 9e359a4f47
    server: continue to update other slots on embedding concurrent request (#5699) Pierrick Hymbert 2024-02-24 19:16:04 +01:00
  • 4c4cb30736
    IQ3_S: a much better alternative to Q3_K (#5676) Kawrakow 2024-02-24 16:23:52 +02:00
  • 525213d2f5
    server: init functional tests (#5566) Pierrick Hymbert 2024-02-24 12:28:55 +01:00
  • fd43d66f46
    server : add KV cache quantization options (#5684) AlpinDale 2024-02-23 19:31:54 +00:00
  • 54fbcd2ce6
    convert : fix missing ftype for gemma (#5690) Jared Van Bortel 2024-02-23 13:39:14 -05:00
  • 15499eb942
    mpt : do not duplicate token_embd.weight on disk (#5670) Jared Van Bortel 2024-02-22 17:05:23 -05:00
  • 96633eeca1
    gemma : use more bits for the token_embd.weight tensor (#5650) Georgi Gerganov 2024-02-22 23:23:46 +02:00
  • 847eedbdb2
    py : add Gemma conversion from HF models (#5647) Georgi Gerganov 2024-02-22 23:22:48 +02:00
  • 7e4f339c40
    ggml : always define ggml_fp16_t as uint16_t (#5666) Georgi Gerganov 2024-02-22 23:21:39 +02:00
  • 334f76fa38
    sync : ggml Georgi Gerganov 2024-02-22 23:21:05 +02:00
  • efd56b1c21
    ggml : 32-bit arm compat (whisper/1891) Georgi Gerganov 2024-02-22 18:31:40 +02:00
  • 201294ae17
    nix: init singularity and docker images (#5056) Someone 2024-02-22 19:44:10 +00:00