repo : update links to new url (#11886)
* repo : update links to new url ggml-ci * cont : more urls ggml-ci
This commit is contained in:
parent
f355229692
commit
68ff663a04
66 changed files with 192 additions and 202 deletions
|
@ -3,9 +3,9 @@
|
|||
This example demonstrates how to generate a control vector using gguf models.
|
||||
|
||||
Related PRs:
|
||||
- [Add support for control vectors](https://github.com/ggerganov/llama.cpp/pull/5970)
|
||||
- (Issue) [Generate control vector using llama.cpp](https://github.com/ggerganov/llama.cpp/issues/6880)
|
||||
- [Add cvector-generator example](https://github.com/ggerganov/llama.cpp/pull/7514)
|
||||
- [Add support for control vectors](https://github.com/ggml-org/llama.cpp/pull/5970)
|
||||
- (Issue) [Generate control vector using llama.cpp](https://github.com/ggml-org/llama.cpp/issues/6880)
|
||||
- [Add cvector-generator example](https://github.com/ggml-org/llama.cpp/pull/7514)
|
||||
|
||||
## Examples
|
||||
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
# llama.cpp/examples/imatrix
|
||||
|
||||
Compute an importance matrix for a model and given text dataset. Can be used during quantization to enchance the quality of the quantized models.
|
||||
More information is available here: https://github.com/ggerganov/llama.cpp/pull/4861
|
||||
More information is available here: https://github.com/ggml-org/llama.cpp/pull/4861
|
||||
|
||||
## Usage
|
||||
|
||||
|
|
|
@ -100,7 +100,7 @@ bool IMatrixCollector::collect_imatrix(struct ggml_tensor * t, bool ask, void *
|
|||
const float * data = is_host ? (const float *) src1->data : m_src1_data.data();
|
||||
|
||||
// this has been adapted to the new format of storing merged experts in a single 3d tensor
|
||||
// ref: https://github.com/ggerganov/llama.cpp/pull/6387
|
||||
// ref: https://github.com/ggml-org/llama.cpp/pull/6387
|
||||
if (t->op == GGML_OP_MUL_MAT_ID) {
|
||||
// ids -> [n_experts_used, n_tokens]
|
||||
// src1 -> [cols, n_expert_used, n_tokens]
|
||||
|
|
|
@ -14,7 +14,7 @@ project("llama-android")
|
|||
#include(FetchContent)
|
||||
#FetchContent_Declare(
|
||||
# llama
|
||||
# GIT_REPOSITORY https://github.com/ggerganov/llama.cpp
|
||||
# GIT_REPOSITORY https://github.com/ggml-org/llama.cpp
|
||||
# GIT_TAG master
|
||||
#)
|
||||
|
||||
|
|
|
@ -3,9 +3,9 @@
|
|||
Local inference of llama.cpp on an iPhone. This is a sample app that can be used as a starting
|
||||
point for more advanced projects.
|
||||
|
||||
For usage instructions and performance stats, check the following discussion: https://github.com/ggerganov/llama.cpp/discussions/4508
|
||||
For usage instructions and performance stats, check the following discussion: https://github.com/ggml-org/llama.cpp/discussions/4508
|
||||
|
||||

|
||||

|
||||
|
||||
Video demonstration:
|
||||
|
||||
|
|
|
@ -39,7 +39,7 @@
|
|||
"
|
||||
" :call llama#init()
|
||||
"
|
||||
" more info: https://github.com/ggerganov/llama.cpp/pull/9787
|
||||
" more info: https://github.com/ggml-org/llama.cpp/pull/9787
|
||||
"
|
||||
|
||||
" colors (adjust to your liking)
|
||||
|
|
|
@ -26,7 +26,7 @@ python ./convert_hf_to_gguf.py ../MiniCPM-o-2_6/model
|
|||
```
|
||||
|
||||
Build llama.cpp using `CMake`:
|
||||
https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md
|
||||
https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md
|
||||
|
||||
```bash
|
||||
cmake -B build
|
||||
|
|
|
@ -6,7 +6,7 @@ Download [MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-
|
|||
|
||||
Clone llama.cpp:
|
||||
```bash
|
||||
git clone https://github.com/ggerganov/llama.cpp
|
||||
git clone https://github.com/ggml-org/llama.cpp
|
||||
cd llama.cpp
|
||||
```
|
||||
|
||||
|
|
|
@ -4,4 +4,4 @@ Demonstration of lookahead decoding technique:
|
|||
|
||||
https://lmsys.org/blog/2023-11-21-lookahead-decoding/
|
||||
|
||||
More info: https://github.com/ggerganov/llama.cpp/pull/4207
|
||||
More info: https://github.com/ggml-org/llama.cpp/pull/4207
|
||||
|
|
|
@ -8,5 +8,5 @@ The key parameters for lookup decoding are `ngram_min`, `ngram_max` and `n_draft
|
|||
|
||||
More info:
|
||||
|
||||
https://github.com/ggerganov/llama.cpp/pull/4484
|
||||
https://github.com/ggerganov/llama.cpp/issues/4226
|
||||
https://github.com/ggml-org/llama.cpp/pull/4484
|
||||
https://github.com/ggml-org/llama.cpp/issues/4226
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# llama.cpp/examples/main
|
||||
|
||||
This example program allows you to use various LLaMA language models easily and efficiently. It is specifically designed to work with the [llama.cpp](https://github.com/ggerganov/llama.cpp) project, which provides a plain C/C++ implementation with optional 4-bit quantization support for faster, lower memory inference, and is optimized for desktop CPUs. This program can be used to perform various inference tasks with LLaMA models, including generating text based on user-provided prompts and chat-like interactions with reverse prompts.
|
||||
This example program allows you to use various LLaMA language models easily and efficiently. It is specifically designed to work with the [llama.cpp](https://github.com/ggml-org/llama.cpp) project, which provides a plain C/C++ implementation with optional 4-bit quantization support for faster, lower memory inference, and is optimized for desktop CPUs. This program can be used to perform various inference tasks with LLaMA models, including generating text based on user-provided prompts and chat-like interactions with reverse prompts.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
|
@ -121,7 +121,7 @@ When --in-prefix or --in-suffix options are enabled the chat template ( --chat-t
|
|||
|
||||
### Chat templates
|
||||
|
||||
`--chat-template JINJA_TEMPLATE`: This option sets a custom jinja chat template. It accepts a string, not a file name. Default: template taken from model's metadata. Llama.cpp only supports [some pre-defined templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template). These include llama2, llama3, gemma, monarch, chatml, orion, vicuna, vicuna-orca, deepseek, command-r, zephyr. When --in-prefix or --in-suffix options are enabled the chat template ( --chat-template ) is disabled.
|
||||
`--chat-template JINJA_TEMPLATE`: This option sets a custom jinja chat template. It accepts a string, not a file name. Default: template taken from model's metadata. Llama.cpp only supports [some pre-defined templates](https://github.com/ggml-org/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template). These include llama2, llama3, gemma, monarch, chatml, orion, vicuna, vicuna-orca, deepseek, command-r, zephyr. When --in-prefix or --in-suffix options are enabled the chat template ( --chat-template ) is disabled.
|
||||
|
||||
Example usage: `--chat-template gemma`
|
||||
|
||||
|
|
|
@ -5,8 +5,8 @@ models ability to recall information from long contexts.
|
|||
|
||||
See the following PRs for more info:
|
||||
|
||||
- https://github.com/ggerganov/llama.cpp/pull/3856
|
||||
- https://github.com/ggerganov/llama.cpp/pull/4810
|
||||
- https://github.com/ggml-org/llama.cpp/pull/3856
|
||||
- https://github.com/ggml-org/llama.cpp/pull/4810
|
||||
|
||||
### Usage
|
||||
|
||||
|
|
|
@ -23,7 +23,7 @@ def create_completion(host, prompt, gbnf_grammar):
|
|||
"""Calls the /completion API on llama-server.
|
||||
|
||||
See
|
||||
https://github.com/ggerganov/llama.cpp/tree/HEAD/examples/server#api-endpoints
|
||||
https://github.com/ggml-org/llama.cpp/tree/HEAD/examples/server#api-endpoints
|
||||
"""
|
||||
print(f" Request:\n Grammar:\n{textwrap.indent(gbnf_grammar, ' ')}\n Prompt:\n{textwrap.indent(prompt.rstrip(), ' ')}")
|
||||
headers = {"Content-Type": "application/json"}
|
||||
|
|
|
@ -69,22 +69,22 @@ Several quantization methods are supported. They differ in the resulting model d
|
|||
| 13B | ms/tok @ 8th | - | 73 | 82 | 98 | 105 | 128 |
|
||||
| 13B | bits/weight | 16.0 | 4.5 | 5.0 | 5.5 | 6.0 | 8.5 |
|
||||
|
||||
- [k-quants](https://github.com/ggerganov/llama.cpp/pull/1684)
|
||||
- [k-quants](https://github.com/ggml-org/llama.cpp/pull/1684)
|
||||
- recent k-quants improvements and new i-quants
|
||||
- [#2707](https://github.com/ggerganov/llama.cpp/pull/2707)
|
||||
- [#2807](https://github.com/ggerganov/llama.cpp/pull/2807)
|
||||
- [#4773 - 2-bit i-quants (inference)](https://github.com/ggerganov/llama.cpp/pull/4773)
|
||||
- [#4856 - 2-bit i-quants (inference)](https://github.com/ggerganov/llama.cpp/pull/4856)
|
||||
- [#4861 - importance matrix](https://github.com/ggerganov/llama.cpp/pull/4861)
|
||||
- [#4872 - MoE models](https://github.com/ggerganov/llama.cpp/pull/4872)
|
||||
- [#4897 - 2-bit quantization](https://github.com/ggerganov/llama.cpp/pull/4897)
|
||||
- [#4930 - imatrix for all k-quants](https://github.com/ggerganov/llama.cpp/pull/4930)
|
||||
- [#4951 - imatrix on the GPU](https://github.com/ggerganov/llama.cpp/pull/4957)
|
||||
- [#4969 - imatrix for legacy quants](https://github.com/ggerganov/llama.cpp/pull/4969)
|
||||
- [#4996 - k-quants tuning](https://github.com/ggerganov/llama.cpp/pull/4996)
|
||||
- [#5060 - Q3_K_XS](https://github.com/ggerganov/llama.cpp/pull/5060)
|
||||
- [#5196 - 3-bit i-quants](https://github.com/ggerganov/llama.cpp/pull/5196)
|
||||
- [quantization tuning](https://github.com/ggerganov/llama.cpp/pull/5320), [another one](https://github.com/ggerganov/llama.cpp/pull/5334), and [another one](https://github.com/ggerganov/llama.cpp/pull/5361)
|
||||
- [#2707](https://github.com/ggml-org/llama.cpp/pull/2707)
|
||||
- [#2807](https://github.com/ggml-org/llama.cpp/pull/2807)
|
||||
- [#4773 - 2-bit i-quants (inference)](https://github.com/ggml-org/llama.cpp/pull/4773)
|
||||
- [#4856 - 2-bit i-quants (inference)](https://github.com/ggml-org/llama.cpp/pull/4856)
|
||||
- [#4861 - importance matrix](https://github.com/ggml-org/llama.cpp/pull/4861)
|
||||
- [#4872 - MoE models](https://github.com/ggml-org/llama.cpp/pull/4872)
|
||||
- [#4897 - 2-bit quantization](https://github.com/ggml-org/llama.cpp/pull/4897)
|
||||
- [#4930 - imatrix for all k-quants](https://github.com/ggml-org/llama.cpp/pull/4930)
|
||||
- [#4951 - imatrix on the GPU](https://github.com/ggml-org/llama.cpp/pull/4957)
|
||||
- [#4969 - imatrix for legacy quants](https://github.com/ggml-org/llama.cpp/pull/4969)
|
||||
- [#4996 - k-quants tuning](https://github.com/ggml-org/llama.cpp/pull/4996)
|
||||
- [#5060 - Q3_K_XS](https://github.com/ggml-org/llama.cpp/pull/5060)
|
||||
- [#5196 - 3-bit i-quants](https://github.com/ggml-org/llama.cpp/pull/5196)
|
||||
- [quantization tuning](https://github.com/ggml-org/llama.cpp/pull/5320), [another one](https://github.com/ggml-org/llama.cpp/pull/5334), and [another one](https://github.com/ggml-org/llama.cpp/pull/5361)
|
||||
|
||||
**Llama 2 7B**
|
||||
|
||||
|
|
|
@ -3,7 +3,7 @@
|
|||
Demonstration of simple retrieval technique based on cosine similarity
|
||||
|
||||
More info:
|
||||
https://github.com/ggerganov/llama.cpp/pull/6193
|
||||
https://github.com/ggml-org/llama.cpp/pull/6193
|
||||
|
||||
### How to use
|
||||
|
||||
|
|
|
@ -5,7 +5,7 @@ option(LLAMA_SERVER_SSL "Build SSL support for the server" OFF)
|
|||
include_directories(${CMAKE_CURRENT_SOURCE_DIR} ${CMAKE_CURRENT_BINARY_DIR})
|
||||
|
||||
if (MINGW)
|
||||
# fix: https://github.com/ggerganov/llama.cpp/actions/runs/9651004652/job/26617901362?pr=8006
|
||||
# fix: https://github.com/ggml-org/llama.cpp/actions/runs/9651004652/job/26617901362?pr=8006
|
||||
add_compile_definitions(_WIN32_WINNT=${GGML_WIN_VER})
|
||||
endif()
|
||||
|
||||
|
|
|
@ -7,14 +7,14 @@ Set of LLM REST APIs and a simple web front end to interact with llama.cpp.
|
|||
**Features:**
|
||||
* LLM inference of F16 and quantized models on GPU and CPU
|
||||
* [OpenAI API](https://github.com/openai/openai-openapi) compatible chat completions and embeddings routes
|
||||
* Reranking endoint (WIP: https://github.com/ggerganov/llama.cpp/pull/9510)
|
||||
* Reranking endoint (WIP: https://github.com/ggml-org/llama.cpp/pull/9510)
|
||||
* Parallel decoding with multi-user support
|
||||
* Continuous batching
|
||||
* Multimodal (wip)
|
||||
* Monitoring endpoints
|
||||
* Schema-constrained JSON response format
|
||||
|
||||
The project is under active development, and we are [looking for feedback and contributors](https://github.com/ggerganov/llama.cpp/issues/4216).
|
||||
The project is under active development, and we are [looking for feedback and contributors](https://github.com/ggml-org/llama.cpp/issues/4216).
|
||||
|
||||
## Usage
|
||||
|
||||
|
@ -65,7 +65,7 @@ The project is under active development, and we are [looking for feedback and co
|
|||
| `-np, --parallel N` | number of parallel sequences to decode (default: 1)<br/>(env: LLAMA_ARG_N_PARALLEL) |
|
||||
| `--mlock` | force system to keep model in RAM rather than swapping or compressing<br/>(env: LLAMA_ARG_MLOCK) |
|
||||
| `--no-mmap` | do not memory-map model (slower load but may reduce pageouts if not using mlock)<br/>(env: LLAMA_ARG_NO_MMAP) |
|
||||
| `--numa TYPE` | attempt optimizations that help on some NUMA systems<br/>- distribute: spread execution evenly over all nodes<br/>- isolate: only spawn threads on CPUs on the node that execution started on<br/>- numactl: use the CPU map provided by numactl<br/>if run without this previously, it is recommended to drop the system page cache before using this<br/>see https://github.com/ggerganov/llama.cpp/issues/1437<br/>(env: LLAMA_ARG_NUMA) |
|
||||
| `--numa TYPE` | attempt optimizations that help on some NUMA systems<br/>- distribute: spread execution evenly over all nodes<br/>- isolate: only spawn threads on CPUs on the node that execution started on<br/>- numactl: use the CPU map provided by numactl<br/>if run without this previously, it is recommended to drop the system page cache before using this<br/>see https://github.com/ggml-org/llama.cpp/issues/1437<br/>(env: LLAMA_ARG_NUMA) |
|
||||
| `-dev, --device <dev1,dev2,..>` | comma-separated list of devices to use for offloading (none = don't offload)<br/>use --list-devices to see a list of available devices<br/>(env: LLAMA_ARG_DEVICE) |
|
||||
| `--list-devices` | print list of available devices and exit |
|
||||
| `-ngl, --gpu-layers, --n-gpu-layers N` | number of layers to store in VRAM<br/>(env: LLAMA_ARG_N_GPU_LAYERS) |
|
||||
|
@ -178,7 +178,7 @@ Example usage of docker compose with environment variables:
|
|||
```yml
|
||||
services:
|
||||
llamacpp-server:
|
||||
image: ghcr.io/ggerganov/llama.cpp:server
|
||||
image: ghcr.io/ggml-org/llama.cpp:server
|
||||
ports:
|
||||
- 8080:8080
|
||||
volumes:
|
||||
|
@ -273,10 +273,10 @@ You can consume the endpoints with Postman or NodeJS with axios library. You can
|
|||
### Docker
|
||||
|
||||
```bash
|
||||
docker run -p 8080:8080 -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:server -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080
|
||||
docker run -p 8080:8080 -v /path/to/models:/models ghcr.io/ggml-org/llama.cpp:server -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080
|
||||
|
||||
# or, with CUDA:
|
||||
docker run -p 8080:8080 -v /path/to/models:/models --gpus all ghcr.io/ggerganov/llama.cpp:server-cuda -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080 --n-gpu-layers 99
|
||||
docker run -p 8080:8080 -v /path/to/models:/models --gpus all ghcr.io/ggml-org/llama.cpp:server-cuda -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080 --n-gpu-layers 99
|
||||
```
|
||||
|
||||
## Testing with CURL
|
||||
|
@ -1066,7 +1066,7 @@ print(completion.choices[0].text)
|
|||
|
||||
### POST `/v1/chat/completions`: OpenAI-compatible Chat Completions API
|
||||
|
||||
Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.
|
||||
Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggml-org/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.
|
||||
|
||||
*Options:*
|
||||
|
||||
|
@ -1120,7 +1120,7 @@ curl http://localhost:8080/v1/chat/completions \
|
|||
|
||||
*Tool call support*
|
||||
|
||||
[Function calling](https://platform.openai.com/docs/guides/function-calling) is supported for all models (see https://github.com/ggerganov/llama.cpp/pull/9639):
|
||||
[Function calling](https://platform.openai.com/docs/guides/function-calling) is supported for all models (see https://github.com/ggml-org/llama.cpp/pull/9639):
|
||||
|
||||
- Requires `--jinja` flag
|
||||
- Native tool call formats supported:
|
||||
|
@ -1599,7 +1599,7 @@ Apart from error types supported by OAI, we also have custom types that are spec
|
|||
|
||||
### Legacy completion web UI
|
||||
|
||||
A new chat-based UI has replaced the old completion-based since [this PR](https://github.com/ggerganov/llama.cpp/pull/10175). If you want to use the old completion, start the server with `--path ./examples/server/public_legacy`
|
||||
A new chat-based UI has replaced the old completion-based since [this PR](https://github.com/ggml-org/llama.cpp/pull/10175). If you want to use the old completion, start the server with `--path ./examples/server/public_legacy`
|
||||
|
||||
For example:
|
||||
|
||||
|
|
|
@ -42,7 +42,7 @@ enum stop_type {
|
|||
STOP_TYPE_LIMIT,
|
||||
};
|
||||
|
||||
// state diagram: https://github.com/ggerganov/llama.cpp/pull/9283
|
||||
// state diagram: https://github.com/ggml-org/llama.cpp/pull/9283
|
||||
enum slot_state {
|
||||
SLOT_STATE_IDLE,
|
||||
SLOT_STATE_STARTED, // TODO: this state is only used for setting up the initial prompt processing; maybe merge it with launch_slot_with_task in the future
|
||||
|
|
|
@ -367,10 +367,10 @@ inline std::string format_chat(const common_chat_template & tmpl, const std::vec
|
|||
}
|
||||
}
|
||||
} else {
|
||||
throw std::runtime_error("Invalid 'content' type (ref: https://github.com/ggerganov/llama.cpp/issues/8367)");
|
||||
throw std::runtime_error("Invalid 'content' type (ref: https://github.com/ggml-org/llama.cpp/issues/8367)");
|
||||
}
|
||||
} else {
|
||||
throw std::runtime_error("Missing 'content' (ref: https://github.com/ggerganov/llama.cpp/issues/8367)");
|
||||
throw std::runtime_error("Missing 'content' (ref: https://github.com/ggml-org/llama.cpp/issues/8367)");
|
||||
}
|
||||
|
||||
chat.push_back({role, content, /* tool_calls= */ {}});
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# llama.cpp/example/simple-cmake-pkg
|
||||
|
||||
This program builds [simple](../simple) using a relocatable CMake package. It serves as an example of using the `find_package()` CMake command to conveniently include [llama.cpp](https://github.com/ggerganov/llama.cpp) in projects which live outside of the source tree.
|
||||
This program builds [simple](../simple) using a relocatable CMake package. It serves as an example of using the `find_package()` CMake command to conveniently include [llama.cpp](https://github.com/ggml-org/llama.cpp) in projects which live outside of the source tree.
|
||||
|
||||
## Building
|
||||
|
||||
|
@ -13,7 +13,7 @@ When hardware acceleration libraries are used (e.g. CUDA, Metal, Vulkan, etc.),
|
|||
### Build llama.cpp and install to llama.cpp/inst
|
||||
|
||||
```sh
|
||||
git clone https://github.com/ggerganov/llama.cpp
|
||||
git clone https://github.com/ggml-org/llama.cpp
|
||||
cd llama.cpp
|
||||
cmake -S . -B build
|
||||
cmake --build build
|
||||
|
|
|
@ -4,6 +4,6 @@ Demonstration of speculative decoding and tree-based speculative decoding techni
|
|||
|
||||
More info:
|
||||
|
||||
- https://github.com/ggerganov/llama.cpp/pull/2926
|
||||
- https://github.com/ggerganov/llama.cpp/pull/3624
|
||||
- https://github.com/ggerganov/llama.cpp/pull/5625
|
||||
- https://github.com/ggml-org/llama.cpp/pull/2926
|
||||
- https://github.com/ggml-org/llama.cpp/pull/3624
|
||||
- https://github.com/ggml-org/llama.cpp/pull/5625
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue