Commit graph

18 commits

Author SHA1 Message Date
Howard Su
32c5411631
Revert "Support using mmap when applying LoRA ()" ()
Has perf regression when mlock is used.

This reverts commit 2347463201.
2023-07-13 21:58:25 +08:00
Howard Su
2347463201
Support using mmap when applying LoRA ()
* Support using mmap when applying LoRA

* Fix Linux

* Update comment to reflect the support lora with mmap
2023-07-11 22:37:01 +08:00
Howard Su
b8c8dda75f
Use unsigned for random seed ()
* Use unsigned for random seed. Keep -1 as the value to use a time based seed.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-29 06:15:15 -07:00
zrm
b853d45601
ggml : add NUMA support ()
* detect NUMA systems and pin work threads to nodes (linux)

* disable mmap prefetch/readahead for NUMA systems

* avoid sending finalize op to thread pool if it does nothing

* silence robot

* fix args

* make --numa a param

* recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement

* lower synchronization overhead

* statically allocate

* move numa state to g_state

* add description for --numa

* ggml : minor style changes

* ggml : minor style + try fix sanitizer build

* llama : allow to initialize backend with NUMA support

* llama : avoid ggml include in llama-util.h

* ggml : style / formatting

* ggml : fix handling of ops with n_threads > n_tasks > 1

* server : utilize numa parameter

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-26 20:57:59 +03:00
Johannes Gäßler
254a7a7a5f
CUDA full GPU acceleration, KV cache in VRAM ()
* Fixed CUDA RoPE

* ggml_cuda_mul_mat_vec_p021

* ggml_cuda_scale

* ggml_cuda_diag_mask_inf

* ggml_is_permuted

* ggml_cuda_cpy

* flatten rows for ggml_cuda_op

* Added a --low-vram option

* Fixed Windows performance

* Fixed LLAMA_CUDA_DMMV_Y > 1 for WizardLM
2023-06-14 19:47:19 +02:00
Johannes Gäßler
17366df842
Multi GPU support, CUDA refactor, CUDA scratch buffer ()
* CUDA multi GPU + scratch

ggml_cuda_compute_forward

Tensor parallelism

ggml_cuda_add

ggml_cuda_rms_norm

ggml_cuda_silu

CUDA scratch buffer

--main-gpu CLI option
2023-06-06 21:33:23 +02:00
Kerfuffle
1b78ed2081
Only show -ngl option when relevant + other doc/arg handling updates ()
1. Add a `LLAMA_SUPPORTS_GPU_OFFLOAD` define to `llama.h` (defined when compiled with CLBlast or cuBLAS)
2. Update the argument handling in the common example code to only show the `-ngl`, `--n-gpu-layers` option when GPU offload is possible.
3. Add an entry for the `-ngl`, `--n-gpu-layers` option to the `main` and `server` examples documentation
4. Update `main` and `server` examples documentation to use the new style dash separator argument format
5. Update the `server` example to use dash separators for its arguments and adds `-ngl` to `--help` (only shown when compiled with appropriate support). It will still support `--memory_f32` and `--ctx_size` for compatibility.
6. Add a warning discouraging use of `--memory-f32` for the `main` and `server` examples `--help` text as well as documentation. Rationale: https://github.com/ggerganov/llama.cpp/discussions/1593#discussioncomment-6004356
2023-05-28 11:48:57 -06:00
Kerfuffle
66874d4fbc
Some improvements to loading the session with --prompt-cache ()
Improvements to loading the session with `--prompt-cache` in the `main` example.

1. Fix an issue where the `--seed` parameter was ignored when loading a cached prompt.
2. When loading a cached prompt, you previously had to specify the saved prompt (or a prefix of it) again. This pull changes that behavior to default to the prompt that was cached if a prompt wasn't specified by the user.
2023-05-25 20:18:01 -06:00
Evan Jones
cf348a60e0
main : add option to save full output to session ()
* main : add option to save full output to session

* split behavior into --session and --prompt-cache

* restore original implementation with new names

* PR comments

* move the check for incompatible parameters to gpt_params_parse

* Fix whitespace

Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>

---------

Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>
2023-05-10 11:37:14 -04:00
44670
2edbdb0f99
main : add --in-suffix option ()
* adding --in-suffix option

* print input suffix before generation
2023-05-04 18:41:12 +03:00
DannyDaemonic
db1080876a
Only escape prompts when used with -e () 2023-05-04 05:08:25 -07:00
DannyDaemonic
c65a7fbfa9
Update main's README.md with new features () 2023-05-04 03:02:59 -07:00
Robert Brisita
2bb992f034
llama : allow 0 as a seed number. () 2023-05-02 19:23:44 +03:00
mgroeber9110
9b0a4d4214
examples/main README improvements and some light refactoring () 2023-04-24 15:45:32 +00:00
slaren
1d78fecdab
Fix LoRA acronym () 2023-04-23 23:03:44 +02:00
DannyDaemonic
edce63baa9
Added README.md for main with examples and explanations () 2023-04-23 15:37:02 +00:00
Pavol Rusnak
8b679987cd
Fix whitespace, add .editorconfig, add GitHub workflow () 2023-04-11 19:45:44 +00:00
Georgi Gerganov
a316a425d0
Overhaul the examples structure
- main -> examples
- utils -> examples (renamed to "common")
- quantize -> examples
- separate tools for "perplexity" and "embedding"

Hope I didn't break something !
2023-03-25 20:26:40 +02:00