llama.cpp

Author	SHA1	Message	Date
Abhilash Majumder	0f648573dd	ggml : add unified SYCL backend for Intel GPUs (#2690 ) * first update for migration * update init_cublas * add debug functio, commit all help code * step 1 * step 2 * step3 add fp16, slower 31->28 * add GGML_LIST_DEVICE function * step 5 format device and print * step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue * support main device is non-zero * step7 add debug for code path, rm log * step 8, rename all macro & func from cuda by sycl * fix error of select non-zero device, format device list * ren ggml-sycl.hpp -> ggml-sycl.h * clear CMAKE to rm unused lib and options * correct queue: rm dtct:get_queue * add print tensor function to debug * fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481 * summary dpct definition in one header file to replace folder:dpct * refactor device log * mv dpct definition from folder dpct to ggml-sycl.h * update readme, refactor build script * fix build with sycl * set nthread=1 when sycl, increase performance * add run script, comment debug code * add ls-sycl-device tool * add ls-sycl-device, rm unused files * rm rear space * dos2unix * Update README_sycl.md * fix return type * remove sycl version from include path * restore rm code to fix hang issue * add syc and link for sycl readme * rm original sycl code before refactor * fix code err * add know issue for pvc hang issue * enable SYCL_F16 support * align pr4766 * check for sycl blas, better performance * cleanup 1 * remove extra endif * add build&run script, clean CMakefile, update guide by review comments * rename macro to intel hardware * editor config format * format fixes * format fixes * editor format fix * Remove unused headers * skip build sycl tool for other code path * replace tab by space * fix blas matmul function * fix mac build * restore hip dependency * fix conflict * ren as review comments * mv internal function to .cpp file * export funciton print_sycl_devices(), mv class dpct definition to source file * update CI/action for sycl code, fix CI error of repeat/dup * fix action ID format issue * rm unused strategy * enable llama_f16 in ci * fix conflict * fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml * fix ci cases for unsupported data type * revert unrelated changed in cuda cmake remove useless nommq fix typo of GGML_USE_CLBLAS_SYCL * revert hip cmake changes * fix indent * add prefix in func name * revert no mmq * rm cpu blas duplicate * fix no_new_line * fix src1->type==F16 bug. * pass batch offset for F16 src1 * fix batch error * fix wrong code * revert sycl checking in test-sampling * pass void as arguments of ggml_backend_sycl_print_sycl_devices * remove extra blank line in test-sampling * revert setting n_threads in sycl * implement std::isinf for icpx with fast math. * Update ci/run.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add copyright and MIT license declare * update the cmd example --------- Co-authored-by: jianyuzh <jianyu.zhang@intel.com> Co-authored-by: luoyu-intel <yu.luo@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-28 17:56:23 +02:00
Johannes Gäßler	9241c3a2ac	Apply min_p to unsorted tokens (#5115 )	2024-01-28 09:59:49 +01:00
Johannes Gäßler	b2b2bf988c	Tests for min_p, sampling queue (#5147 )	2024-01-28 09:35:14 +01:00
sharpHL	f2e69d28c0	llama : add support for Orion-14B (#5118 ) * add support for Orion-14B(https://huggingface.co/OrionStarAI/Orion-14B-Chat) * flake8 support * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Update llama.cpp * Update llama.cpp --------- Co-authored-by: lixiaopu <lixiaopu@cmcm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2024-01-28 10:00:30 +02:00
Kawrakow	1182cf4d4f	Another bucket sort (#5109 ) * Initial bucket sort * Bucket sort: slightly better version * Bucket sort: another minor improvement --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-26 09:14:39 +02:00
l3utterfly	5eaf9964fc	llama : dynamic temperature sampling (#4972 ) * implemented dynamic temperature sampling from koboldcpp * removed trailing whitespace * removed unused temp parameter in llama_sample_entropy * exposed exponent_val in dynamic temp sampler * added debug check for printf statements * use nullptr in llama_sample_softmax call during llama_sample_entropy this avoids counting the time taken stats twice Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * return earlier if there is only 1 candiate (i.e. max_entropy == 0) * reformat 't' case in llama_sample_queue Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * check for one or zero candidates case in llama_sample_entropy --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-01-25 22:06:22 +02:00
Kawrakow	faa3526a1e	Fix Q3_K_XS for MoE models (#5113 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-25 17:58:53 +02:00
slaren	1387ea2117	llama : pre-allocate input tensors in a separate buffer (#5100 )	2024-01-24 12:48:14 +01:00
Georgi Gerganov	89758723c7	minor : clean-up some warnings and style (#5094 ) * minor : clean-up some warnings and style ggml-ci * ggml : add comment	2024-01-23 14:12:57 +02:00
slaren	011e8ec577	llama : fix not enough space in buffer with Qwen (#5086 )	2024-01-22 23:42:41 +01:00
compilade	d6bd4d46dd	llama : support StableLM 2 1.6B (#5052 ) * llama : support StableLM 2 1.6B * convert : fix Qwen's set_vocab wrongly naming all special tokens [PAD{id}] * convert : refactor Qwen's set_vocab to use it for StableLM 2 too * nix : add tiktoken to llama-python-extra * convert : use presence of tokenizer.json to determine StableLM tokenizer loader It's a less arbitrary heuristic than the vocab size.	2024-01-22 13:21:52 +02:00
Kawrakow	66d575c45c	llama : add Q3_K_XS (#5060 ) * Add Q3_K_XS - intermediate size between Q2_K and Q3_K_S * Q3_K_XS: quanize first 1/8 of ffn_down layers with Q4_K Together with an importance matrix, this brings perplexity for LLaMA-v2-70B below the perplexity of the former Q2_K with a 800 MB smaller quantized model size. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-22 12:43:33 +02:00
Shijie	3466c6ebcf	llama : add more qwen2 models (#5071 )	2024-01-22 09:33:19 +02:00
slaren	6df465a91d	llama : run all KQV ops on the CPU with no KV offload (#5049 ) ggml-ci	2024-01-20 17:05:49 +02:00
Shijie	9b75cb2b3c	llama : support upcoming Qwen2 (#5037 )	2024-01-19 13:53:13 +02:00
chiranko	2b3b999cac	llama : add CodeShell support (#5016 ) * llama: add codeshell support * llama.cpp: fix codeshell with NeoX rope Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-19 11:07:27 +02:00
John	57e2a7a52a	llama : fix falcon arch for tied output embeddings (#4978 ) * falcon arch fix for tied output embeddings * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llama.cpp * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-19 00:12:15 +02:00
slaren	96d7f56d29	llama : fix mlock with no-mmap with Metal (#5025 )	2024-01-18 21:12:15 +01:00
Georgi Gerganov	38566680cd	ggml : add IQ2 to test-backend-ops + refactoring (#4990 ) * ggml : add IQ2 to test-backend-ops + refactoring ggml-ci * cuda : update supports_op for IQ2 ggml-ci * ci : enable LLAMA_CUBLAS=1 for CUDA nodes ggml-ci * cuda : fix out-of-bounds-access in `mul_mat_vec_q` ggml-ci * tests : avoid creating RNGs for each Q tensor ggml-ci * tests : avoid creating RNGs for each tensor ggml-ci	2024-01-17 18:54:56 +02:00
Georgi Gerganov	44a1a4a41a	backend : add eval callback (#4935 ) * backend : add eval callback ggml-ci * backend : group nodes in a single compute when user don't need them * backend : clean-up the implementation ggml-ci * simple : do not perform tensor data copy if not needed * simple : fix * simple : no need for ggml_is_contiguous + fix bool parse * llama : fix callback placement in llama_context_params * backend : avoid double-ask callback calls * simple : restore examples, imatrix will serve as a demo	2024-01-17 18:39:41 +02:00
Kawrakow	2b3a665d39	llama : use Q4_K for attn_v for Q2_K_S when n_gqa >= 4 (#4996 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-17 12:36:37 +02:00
Kawrakow	334a835a1c	ggml : importance matrix support for legacy quants (#4969 ) * imatrix: adding support for legacy quants * imatrix: guard Q4_0/Q5_0 against ffn_down craziness --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-16 19:51:26 +02:00
David Friehs	4483396751	llama : apply classifier-free guidance to logits directly (#4951 )	2024-01-15 15:06:52 +02:00
Kawrakow	2faaef3979	llama : check for 256 divisibility for IQ2_XS, IQ2_XXS (#4950 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-15 10:09:38 +02:00
David Pflug	a836c8f534	llama : fix missing quotes (#4937 )	2024-01-14 17:46:00 +02:00
Georgi Gerganov	bb0c139247	llama : check LLAMA_TRACE env for extra logging (#4929 ) * llama : minor fix indent * llama : check LLAMA_TRACE env for extra logging ggml-ci	2024-01-14 13:26:53 +02:00
Georgi Gerganov	03c5267490	llama : use LLAMA_LOG_ macros for logging	2024-01-14 11:03:19 +02:00
Kawrakow	a128c38de8	Fix ffn_down quantization mix for MoE models (#4927 ) * Fix ffn_down quantization mix for MoE models In #4872 I did not consider the part where every third tensor is quantized with more bits. Fir MoE this leads to tensors of the same layer being quantized with different number of bits, which is not considered as a possibility in the inference implementation (it is assumed all experts use the same quantization). * Fix the fix * Review suggestion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-14 10:53:39 +02:00
Karthik Kumar Viswanathan	ac32902a87	llama : support WinXP build with MinGW 8.1.0 (#3419 )	2024-01-14 10:41:44 +02:00
Kawrakow	147b17ac94	2-bit quantizations (#4897 ) * imatrix: load * imatrix: WIP * imatrix: Add Q2_K quantization * imatrix: also guard against Q2_K_S quantization without importance matrix * imatrix: guard even more against low-bit quantization misuse --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-14 09:45:56 +02:00
Kawrakow	807179ec58	Make Q3_K_S be the same as olf Q3_K_L for Mixtral-8x7B (#4906 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-14 09:44:30 +02:00
Georgi Gerganov	4be5ef556d	metal : remove old API (#4919 ) ggml-ci	2024-01-13 20:45:45 +02:00
Georgi Gerganov	f172de03f1	llama : fix detokenization of non-special added-tokens (#4916 ) Co-authored-by: goerch <jhr.walter@t-online.de>	2024-01-13 18:47:38 +02:00
David Friehs	df845cc982	llama : minimize size used for state save/load (#4820 ) * examples : save-load-state: save only required state * llama : only reserve n_vocab * n_batch at most for logits llama_decode asserts that only n_batch tokens are passed each call, and n_ctx is expected to be bigger than n_batch. * llama : always reserve n_vocab * n_batch for logits llama_context de-serialization breaks if the contexts have differing capacity for logits and llama_decode will at maximum resize to n_vocab * n_batch. * llama : only save and restore used logits for batch sizes of 512 this reduces save state in the best case by around 62 MB, which can be a lot if planning to save on each message to allow regenerating messages. * llama : use ostringstream and istringstream for save and load * llama : serialize rng into minimum amount of space required * llama : break session version due to serialization changes	2024-01-13 18:29:43 +02:00
Georgi Gerganov	15ebe59210	convert : update phi-2 to latest HF repo (#4903 ) * convert : update phi-2 to latest HF repo ggml-ci * py : try to fix flake stuff	2024-01-13 13:44:37 +02:00
slaren	e7e4df031b	llama : ggml-backend integration (#4766 ) * llama : ggml-backend integration * ggml-backend : add names to buffers * fix unmap after loading * batched-bench : add tensor_split param * llama : check for null tensor_split * ggml-backend : increase GGML_MAX_BACKENDS * improve graph splitting, partial fix for --no-kv-offload * cuda : add ggml-backend split buffer support * cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available) * ggml : fix null backend dereference (#4807) * ggml : fix null backend dereference * ggml : also check ggml_backend_is_cpu * test-backend-ops : check buffer allocation failures * llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row) * ggml : fix mul_mat_id work size * llama : rewrite session kv load/set without graphs * minor * llama : only initialize used backends, free backends on context free * llama : abort ctx if cuda backend init fails * llama : rewrite lora with ggml-backend and compute on CPU ggml-ci * llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer * opencl : add ggml-backend buffer type * cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf) * llama : on Metal, by default offload the full model ggml-ci * metal : page align the data ptr (#4854) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix split buffer free * address review comments * llama-bench : add split-mode parameter * fix whitespace * opencl : fix double initialization * server : add --split-mode parameter * use async copy and compute to improve multi-gpu performance ggml-ci * use async memcpys to copy the graph outputs to the CPU * fix opencl * use a host buffer for the cpu compute buffer for faster copies to the gpu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-01-12 20:07:38 +01:00
Georgi Gerganov	584d674be6	llama : remove redundant assert for StableLM (#4901 )	2024-01-12 20:54:12 +02:00
Georgi Gerganov	3cabe80630	llama : fix typo "imp_embd" -> "inp_embd"	2024-01-12 13:11:15 +02:00
Georgi Gerganov	f445c0e68c	llama : fix llm_build_k_shift to use correct n_rot (#4889 ) * llama : fix llm_build_k_shift to use correct n_rot ggml-ci * llama : always use hparams.n_rot for ggml_rope_custom ggml-ci * convert : fix persimmon conversion to write correct n_rot	2024-01-12 13:01:56 +02:00
Kawrakow	469e75d0a3	llama : restore intended k-quants mixes for MoE models (#4872 ) * Restore intended k-quants quantization mixes for MoE models * Update Q2_K_S values in the quantize tool Still using LLaMA-v1 PPL values in the quant description today does not make much sense. But let's leave this update for another PR. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-11 21:43:15 +02:00
Kawrakow	49662cbed3	ggml : SOTA 2-bit quants (add IQ2_XS) (#4856 ) * iq2_xs: basics * iq2_xs: this should have been in the basics * iq2_xs: CUDA and scalar CPU works * iq2_xs: WIP Metal * iq2_xs: Metal now works * iq2_xs: working, but dog slow, ARM_NEON dot product * iq2_xs: better ARM_NEON dot product We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when running on the CPU. * iq2_xs: AVX2 dot product - 19.5 t/s * iq2_xs: faster AVX2 dit product 21.4 t/s for TG-128, 59.2 t/s for PP-512. The latter is 2x compared to the previous version. * iq2_xs: had forgotten to delete iq2-data.h * Add llama enum for IQ2_XS --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-11 21:39:39 +02:00
pudepiedj	43f76bf1c3	main : print total token count and tokens consumed so far (#4874 ) * Token count changes * Add show token count * Updating before PR * Two requested changes * Move param def posn	2024-01-11 18:14:52 +02:00
Brian	57d016ba2d	llama : add additional suffixes for model params (#4834 ) * llm_load_print_meta: Add additional suffixs for model params * Update llama.cpp model param log remove unneeded comments and convert from > to >=	2024-01-10 16:09:53 +02:00
Austin	329ff61569	llama : recognize 1B phi models (#4847 ) This update categorizes models with 24 layers as MODEL_1B, ensuring compatibility with different Phi model variants without impacting existing Phi-2 model functionality.	2024-01-10 15:39:09 +02:00
Kawrakow	dd5ae06405	SOTA 2-bit quants (#4773 ) * iq2_xxs: basics * iq2_xxs: scalar and AVX2 dot products Needed to change Q8_K to have quants in the -127...127 range, else the IQ2_XXS AVX implementation becomes very awkward. The alternative would have been to use Q8_0 instead. Perhaps I'll change later, for now this is what we have. * iq2_xxs: ARM_NEON dot product Somehow strangely slow (112 ms/token). * iq2_xxs: WIP Metal Dequantize works, something is still wrong with the dot product. * iq2_xxs: Metal dot product now works We have PP-512 = 475 t/s TG-128 = 47.3 t/s Not the greatest performance, but not complete garbage either. * iq2_xxs: slighty faster dot product TG-128 is now 48.4 t/s * iq2_xxs: slighty faster dot product TG-128 is now 50.9 t/s * iq2_xxs: even faster Metal dot product TG-128 is now 54.1 t/s. Strangely enough, putting the signs lookup table into shared memory has a bigger impact than the grid values being in shared memory. * iq2_xxs: dequantize CUDA kernel - fix conflict with master * iq2_xxs: quantized CUDA dot product (MMVQ) We get TG-128 = 153.1 t/s * iq2_xxs: slightly faster CUDA dot product TG-128 is now at 155.1 t/s. * iq2_xxs: add to llama ftype enum * iq2_xxs: fix MoE on Metal * Fix missing MMQ ops when on hipBLAS I had put the ggml_supports_mmq call at the wrong place. * Fix bug in qequantize_row_iq2_xxs The 0.25f factor was missing. Great detective work by @ggerganov! * Fixing tests * PR suggestion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-08 16:02:32 +01:00
Georgi Gerganov	b0034d93ce	examples : add passkey test (#3856 ) * examples : add passkey test * passkey : better prints * passkey : select pass key pos from CLI * passkey : simplify n_past logic * make : add passkey target * passkey : add "self-extend"-like context extension (#4810) * llama : "self-extend"-like context extension * passkey : add comment * passkey : add readme	2024-01-08 11:14:04 +02:00
Georgi Gerganov	9dede37d81	llama : remove unused vars (#4796 )	2024-01-07 14:29:36 +02:00
Georgi Gerganov	3c36213df8	llama : remove redundant GQA check (#4796 )	2024-01-07 11:21:53 +02:00
Georgi Gerganov	d117d4dc5d	llama : print tensor meta for debugging	2024-01-07 09:51:12 +02:00
Georgi Gerganov	540938f890	llama : llama_model_desc print number of experts	2024-01-02 16:26:45 +02:00

1 2 3 4 5 ...

512 commits