Commit graph

  • 4f6b60c776
    CUDA: Fix models with output size != 32000 (#2480) Johannes Gäßler 2023-08-02 16:48:10 +02:00
  • 220d931864
    readme : add Aquila-7B model series to supported models (#2487) ldwang 2023-08-02 16:21:11 +08:00
  • 81844fbcfd
    tests : Fix compilation warnings (Linux/GCC) (#2451) Eve 2023-08-02 04:06:19 -04:00
  • a312193e18
    readme : Add Chinese LLaMA-2 / Alpaca-2 to supported models (#2475) Yiming Cui 2023-08-02 14:18:31 +08:00
  • c574bddb36
    fix a typo in examples/server/README.md (#2478) Bono Lv 2023-08-01 20:54:28 +08:00
  • 86aeb27734
    server : Support dark mode (#2414) ebraminio 2023-08-01 01:56:23 -07:00
  • 1873ff586b
    metal : add gqa8 kernel to allow llama-2-70B on metal (#2459) Matteo Boschini 2023-08-01 09:43:12 +02:00
  • 49e7cb5bb1
    CUDA: fixed LLAMA_FAST compilation option (#2473) Johannes Gäßler 2023-07-31 21:02:19 +02:00
  • b772bba42e
    CUDA: fixed cmake F16 option (#2471) Johannes Gäßler 2023-07-31 19:52:22 +02:00
  • 0728c5a8b9
    CUDA: mmq CLI option, fixed mmq build issues (#2453) Johannes Gäßler 2023-07-31 15:44:35 +02:00
  • 1215ed7d5c
    CUDA: Implemented row flattening for non-glm RoPE (#2468) Johannes Gäßler 2023-07-31 14:32:30 +02:00
  • 2dbf518911
    CUDA: fewer memory bank conflicts for mul_mat_q (#2458) Johannes Gäßler 2023-07-31 13:18:51 +02:00
  • 9d2382b3e4
    Fix Metal backend broken from the allocator changes (#2455) slaren 2023-07-31 11:02:53 +02:00
  • a113689571
    ggml : add graph tensor allocator (#2411) slaren 2023-07-30 15:58:01 +02:00
  • 11f3ca06b8
    CUDA: Quantized matrix matrix multiplication (#2160) Johannes Gäßler 2023-07-29 23:04:44 +02:00
  • 9baf9ef304
    CUDA: faster multi GPU synchronization (#2448) Johannes Gäßler 2023-07-29 23:04:10 +02:00
  • 8a88e5855c
    perplexity : add Hellaswag calculation (#2389) klosax 2023-07-28 20:25:36 +02:00
  • a9559bf77b
    ggml : workaround for missing _mm256_setr_m128i in GCC < 8 in k_quants.c (#2405) Lee 2023-07-29 02:17:45 +08:00
  • ee1b497c98
    llama : support more diverse tokenizers? (#2420) eric8607242 2023-07-29 02:10:05 +08:00
  • d73b8d48b4
    examples : fix whitespace Georgi Gerganov 2023-07-28 21:05:08 +03:00
  • 34ae1caf7f
    examples : server chat mode with llama2 (#2400) nhamanasu 2023-07-29 03:02:10 +09:00
  • d91f3f0c55
    readme : fix the description of the Tail free sampling (TFS) method (#2431) Weird Constructor 2023-07-28 10:44:43 +02:00
  • 65cdf34bdc
    llama : use n_embd_gqa instead of n_embd to handle llama-2 70B (#2433) Rand Xie 2023-07-28 01:42:53 -07:00
  • edcc7ae7d2
    Obtaining LLaMA 2 instructions (#2308) niansa/tuxifan 2023-07-28 03:14:11 +02:00
  • 7c529cede6
    convert.py : Update to support 70B HF format model files (#2427) mj-shifu 2023-07-27 22:39:17 +02:00
  • 1a941869cb
    metal : disable graph concurrency optimization due to bug (#2413) Georgi Gerganov 2023-07-27 11:00:54 +03:00
  • b5472ea0ad
    ggml : fix assert in ggml_set_unary_op (#2410) slaren 2023-07-26 23:57:23 +02:00
  • 6df1f5940f
    make : build with -Wmissing-prototypes (#2394) Cebtenzzre 2023-07-26 14:00:04 -04:00
  • 5488fb789e
    ggml : allocate graphs in a context (#2392) slaren 2023-07-26 15:56:53 +02:00
  • eb542d3932
    Add LLAMA_DEFAULT_RMS_EPS so we can change the default (#2384) Kawrakow 2023-07-25 18:35:53 +03:00
  • 07aaa0f63f
    ggml : fix ggml_flash_attn to use op_params (#2387) slaren 2023-07-25 16:20:12 +02:00
  • fce48caf9a
    convert.py : support bpe tokenizer (#2228) ldwang 2023-07-25 21:22:09 +08:00
  • 875086bdb9
    ggml : relax contiguous constraints in activation function (#2371) Jiahao Li 2023-07-25 20:58:32 +08:00
  • da1889834a
    ggml : improve graph build time via hash table lookup (#2329) slaren 2023-07-25 14:32:20 +02:00
  • 82552b7f54
    build : fix line breaking error in build-info.sh (#2349) Hesen Peng 2023-07-25 05:24:09 -07:00
  • 0c06204fb3
    main : add --in-prefix-bos to prefix BOS to user inputs; keep EOS (#2304) Xiao-Yong Jin 2023-07-25 07:19:11 -05:00
  • 1fed755b1f
    ci : add non-AVX scalar build/test (#2356) Eve 2023-07-25 08:16:13 -04:00
  • be2301bcda
    k_quants : add AVX support to dot functions with QK_K as 64 (#2339) katsu560 2023-07-25 21:13:41 +09:00
  • 1aa18ef994
    metal : concurrently dispatch commands (#2358) Shouzheng Liu 2023-07-25 08:00:19 -04:00
  • 9a08eaf3c4
    Another speed gain for Q4_0 and Q4_1 on Metal (#2375) Kawrakow 2023-07-25 13:48:29 +03:00
  • 129d844c87
    Fix Q4_K and Q5_K for QK_K = 64 on CUDA (#2359) Kawrakow 2023-07-25 13:48:04 +03:00
  • d5512b782b
    server: add rms_norm_eps parameter (#2380) slaren 2023-07-25 11:36:17 +02:00
  • c798308e3a
    [Server] Escape HTML in webchat (#2368) Henri Vasserman 2023-07-25 10:27:34 +03:00
  • 41c674161f
    make rms_norm_eps a parameter (#2374) slaren 2023-07-24 17:57:12 +02:00
  • b3f138d058
    Chat UI extras (#2366) Aarni Koskela 2023-07-24 17:54:22 +03:00
  • 5b2b2dc6ae
    ggml : sync (unary ops refactor, static-correctness) (#2370) Georgi Gerganov 2023-07-24 14:46:21 +03:00
  • 42f70cb2f6
    Fix scalar version of Q5_K when QK_K = 64 (#2362) Kawrakow 2023-07-24 12:55:02 +03:00
  • 84e09a7d8b
    llama : add grammar-based sampling (#1773) Evan Jones 2023-07-23 23:58:10 -04:00
  • 2f9cf974a0
    Some more Q4_K and Q5_K speedup on CUDA (#2346) Kawrakow 2023-07-24 00:19:47 +03:00
  • 4f06592cc6
    Add gqa parameter support to the server (#2351) IgnacioFDM 2023-07-23 17:31:17 -03:00
  • 70d26ac388
    Fix __dp4a documentation (#2348) Johannes Gäßler 2023-07-23 17:49:06 +02:00
  • 57921ca6db
    common : n_threads == -1 uses std:🧵:hardware_concurrency() (#2347) wzy 2023-07-23 21:33:02 +08:00
  • 3602ac4255
    fix n_tasks (#2342) slaren 2023-07-23 15:19:39 +02:00
  • 95a6c595e7
    ggml: move op parameters from tensors to ggml_tensor::op_params (#2333) slaren 2023-07-23 14:36:02 +02:00
  • e76d630df1
    llama : grouped-query attention + LLaMAv2 70B support (#2276) Georgi Gerganov 2023-07-23 15:09:47 +03:00
  • 1d0824b247
    llama : print help to stdout (#2338) maddes8cht 2023-07-23 13:59:48 +02:00
  • bc3ec2cdc9
    flake : support nix build '.#opencl' (#2337) wzy 2023-07-23 19:57:02 +08:00
  • a940458e48
    llama : print max tensor size to stderr (#2336) Christian Demsar 2023-07-23 07:56:34 -04:00
  • 91171b8072
    make : fix CLBLAST compile support in FreeBSD (#2331) Jose Maldonado 2023-07-23 07:52:08 -04:00
  • 355c80f49e
    examples : simplify vim plugin (#2327) AustinMroz 2023-07-23 06:16:48 -05:00
  • 83a00ce69b
    metal : support bcast add & dup & cont op (#2323) Jiahao Li 2023-07-23 19:00:37 +08:00
  • d2a43664f9
    Speed up Q4_K (#2322) Kawrakow 2023-07-23 08:49:20 +03:00
  • b9b7d94fc1
    CUDA: Fixed 7b q3_K_S with mul_mat_vec_q (#2313) Johannes Gäßler 2023-07-22 21:27:34 +02:00
  • b47b8a9cfe
    llama : optimize memory buffers (#2325) Georgi Gerganov 2023-07-22 21:17:57 +03:00
  • b5fe67f8c6
    Perplexity: Compute scores correlated to HellaSwag (#2312) klosax 2023-07-22 14:21:24 +02:00
  • 24baa54ac1
    examples : basic VIM plugin whoreson 2023-07-22 12:34:51 +02:00
  • dd6c67d3cb
    ci : fix args Georgi Gerganov 2023-07-22 12:00:56 +03:00
  • 5d500e8ccf
    ci : add 7B CUDA tests (#2319) Georgi Gerganov 2023-07-22 11:48:22 +03:00
  • 7d5f18468c
    examples : add easy python script to create quantized (k-bit support) GGML models from local HF Transformer models (#2311) Richard Roberson 2023-07-21 13:01:10 -06:00
  • d924522a46
    Custom RoPE + bettter memory management for CUDA (#2295) Kawrakow 2023-07-21 17:27:51 +03:00
  • 4d76a5f49b
    Faster Q3_K implementation on Metal (#2307) Kawrakow 2023-07-21 17:05:30 +03:00
  • 0db14fef06
    ggml : fix the rope fix (513f861953) Georgi Gerganov 2023-07-21 15:16:55 +03:00
  • 03e566977b
    examples : fix typo in minigpt4.py (#2298) Ikko Eltociear Ashimine 2023-07-21 20:53:07 +09:00
  • 513f861953
    ggml : fix rope args order + assert (#2054) Georgi Gerganov 2023-07-21 14:51:34 +03:00
  • 3973b25a64
    gitignore : fix final newline Georgi Gerganov 2023-07-21 14:42:41 +03:00
  • ab0e26bdfb
    llama : remove cfg smooth factor as it is only a reparameterization of the guidance scale (#2280) Guillaume "Vermeille" Sanchez 2023-07-21 12:58:36 +02:00
  • 73643f5fb1
    gitignore : changes for Poetry users + chat examples (#2284) Jose Maldonado 2023-07-21 06:53:27 -04:00
  • a814d04f81
    make : fix indentation Georgi Gerganov 2023-07-21 13:50:55 +03:00
  • 4c013bb738
    ci : fix MNT realpath usage (#2250) Georgi Gerganov 2023-07-21 13:48:18 +03:00
  • 42c7c2e2e9
    make : support customized LLAMA_CUDA_NVCC and LLAMA_CUDA_CCBIN (#2275) Sky Yan 2023-07-21 18:38:57 +08:00
  • 78a3d13424
    flake : remove intel mkl from flake.nix due to missing files (#2277) wzy 2023-07-21 18:26:34 +08:00
  • ae178ab46b
    llama : make tensor_split ptr instead of array (#2272) Georgi Gerganov 2023-07-21 13:10:51 +03:00
  • 54e3bc76fe
    make : add new target for test binaries (#2244) Jiří Podivín 2023-07-21 12:09:16 +02:00
  • 019fe257bb
    MIKU MAYHEM: Upgrading the Default Model for Maximum Fun 🎉 (#2287) Hatsune Miku 2023-07-21 08:13:18 +00:00
  • e68c96f7fe
    Faster Q2_K on Metal (#2297) Kawrakow 2023-07-21 10:44:40 +03:00
  • 9cf022a188
    make : fix embdinput library and server examples building on MSYS2 (#2235) Przemysław Pawełczyk 2023-07-21 09:42:21 +02:00
  • e782c9e735
    Faster Q5_K and Q6_K on Metal (#2294) Kawrakow 2023-07-20 18:19:45 +03:00
  • 785829dfe8
    Faster Q4_K on Metal (#2290) Kawrakow 2023-07-20 15:18:43 +03:00
  • fff0e0eafe llama : fix regression from #2000 - could not load no-mmap models Georgi Gerganov 2023-07-20 13:47:26 +03:00
  • 417a85a001
    metal: minor q4 optimization and reduce code size (#2248) Shouzheng Liu 2023-07-20 06:32:22 -04:00
  • 294f424554
    llama : extend API to get max devices at runtime (#2253) Rinne 2023-07-19 15:06:40 +08:00
  • 45a1b07e9b
    flake : update flake.nix (#2270) wzy 2023-07-19 15:01:55 +08:00
  • b1f4290953
    cmake : install targets (#2256) wzy 2023-07-19 15:01:11 +08:00
  • d01bccde9f
    ci : integrate with ggml-org/ci (#2250) Georgi Gerganov 2023-07-18 14:24:43 +03:00
  • 6cbf9dfb32
    llama : shorten quantization descriptions Georgi Gerganov 2023-07-18 11:50:49 +03:00
  • 7568d1a2b2
    Support dup & cont ops on CUDA (#2242) Jiahao Li 2023-07-18 01:39:29 +08:00
  • b7647436cc
    llama : fix t_start_sample_us initialization warning (#2238) Alex Klinkhamer 2023-07-16 14:01:45 -07:00
  • 672dda10e4
    ggml : fixed runtime bugs and compile errors related to GGML_PERF and GGML_DEBUG (#2219) Qingyou Meng 2023-07-17 03:57:28 +08:00
  • 27ab66e437
    py : turn verify-checksum-models.py into executable (#2245) Jiří Podivín 2023-07-16 21:54:47 +02:00
  • 6e7cca4047
    llama : add custom RoPE (#2054) Xiao-Yong Jin 2023-07-15 06:34:16 -04:00