parallel : add option for non-shared and larger prompts (#13598)

* parallel : add option for non-shared and larger prompts

* parallel : update readme [no ci]

* cont : add note about base models [no ci]

* parallel : better var name

ggml-ci
This commit is contained in:
Georgi Gerganov 2025-05-17 12:58:55 +03:00 committed by GitHub
parent 2f5a4e1e09
commit 518329b2d4
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 99 additions and 16 deletions

View file

@ -1,3 +1,14 @@
# llama.cpp/example/parallel
Simplified simulation of serving incoming requests in parallel
## Example
Generate 128 client requests (`-ns 128`), simulating 8 concurrent clients (`-np 8`). The system prompt is shared (`-pps`), meaning that it is computed once at the start. The client requests consist of 10 junk questions (`-j 10`) followed by the actual question.
```bash
llama-parallel -m model.gguf -np 8 -ns 128 --top-k 1 -pps --junk 10 -c 16384
```
> [!NOTE]
> It's recommended to use base models with this example. Instruction tuned models might not be able to properly follow the custom chat template specified here, so the results might not be as expected.