This restores the behavior from #491. This does not affect Ctrl+D's ability to
terminate --multiline-input lines (#1040).
This also actually implements #587: "If the user wants the text to end in a
newline, this should be accomplished by explicitly adding a newline by using
\ followed by return, then returning control by pressing return again."
Fixes#12949
* server : use std::move whenever possible
* use r-value ref
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* make task creation scoped
* restore std::move
* fix task_id not set correctly
* apply changes from suggestion
Co-authored-by: ggerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* mtmd : add more api around mtmd_image_tokens
* mtmd : ability to calc image hash
* shared_ptr for mtmd_image_tokens
* move hash to user-define ID (fixed)
* fix prompt_modified
* rm redundant data member
Add RPC_CMD_HELLO for getting the version of the protocol implemend by
the server. Follow the semantic versioning rules at https://semver.org
Hopefully this bring better user experience when we make breaking
changes at the protocol level and avoid issues like #12465
* graph : make mla compatible with FA
* metal : add exp FA kernels for DeepSeek models
ggml-ci
* llama : minor naming updates
ggml-ci
* ggml : disable FA for DS head sizes
* tests : add FA tests for MLA shapes
ggml-ci
Submit operators using asynchronous threads to improve performance.
Use the environment variable GGML_CANN_ASYNC_MODE to control whether
asynchronous submission is enabled. It is disabled by default.
Testing shows a 10%–20% performance improvement in scenarios with
small parameter sizes, especially in quantized models.
The Granite's FIM tokens are very similar to Qwen's; it's just that
they use underscore instead of a dash. So <fim_middle> for example
instead of <fim-middle>.
Opening up tokenizer_config.json in ibm-granite/granite-3.3-8b-base
shows:
```
"<fim_prefix>",
"<fim_middle>",
"<fim_suffix>",
"<fim_pad>",
...
"<reponame>",
```
The grouped query attention optmization doesn't require a power of two ratio,
the only thing relying on it was the modulo operation written as bitwise &.
split_k need not depend on gqa_ratio - enable it any time there's only one
workgroup in the X dimension. The shader gets the split index from the x coord,
and multiple workgroups in the X dimension (pre-split) indicates a larger
FA operation that wouldn't need splitting.
* opencl: refactor - split the kernel files
---------
Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com>
* opencl: split more kernels into separate files
* opencl: specify subgroup size instead of querying it
* opencl: refine Adreno cl compiler version parsing
* opencl: skip some kernels not used by Adreno on old compilers
* opencl: refine logic for selecting Adreno kernels
* opencl: refine Adreno cl compiler version
* opencl: cleanup preprocessor for kernels
* opencl: consider Adreno CL compiler on Windows
* opencl: add final newline for `mul_mv_f16_f16.cl`
---------
Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com>
Replace compile-time `GGML_HIP_UMA` with environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY`. This unifies the usage on NVIDIA and AMD GPUs, and allows a single binary to be shared between integrated and dedicated GPUs.
* Merged using squash to remove all noise commit messages
* Force flash attention off for `LLM_ARCH_DEEPSEEK2` - embedding too large
* Removed 3 conts (2x RoPE and 1x RMS-norm)
* Changed to use `<cmath>` instead of `<math.h>`
* Reverted removal of the 3 conts
* Used `reshape` in `llm_graph_context::build_attn_mha()`
* Use `k_pe = ggml_reshape`
* Removed the 3 conts again
* Removed the 3D views of `wk_b` and `wv_b`, and just save and 3D in GGUF
* Removed MQA optimisation from `build_attn_mha()` as no gains now
* Simplified `is_mla` branch in `llm_build_deepseek2()`
* Removed `build_attn_mla` and added `nullptr` to all `build_atnn` calls
* Fixed call to `build_attn` in `llm_build_t5_enc`
Multiple optional memory pools are provided for CANN, including VMM,
priority queue-based, and traditional memory pools.
1.When the memory pool is available and GGML_CANN_DISABLE_VMM_POOL
is not defined, the VMM pool is selected by default.
2.Otherwise, if GGML_CANN_ENABLE_BUF_PRIO_POOL is defined,
the priority queue-based memory pool is used.
3.If neither condition is met, the default memory pool is used.
* Add llama_model_quantize_params parameters
* Add new quantize parameters parsing and validation
* Update usage
* Add new parameters defaults
* Add new quantization parameters logic
* Add llama_model_quantize_params parameters
* Add new quantize parameters parsing and validation
* Update usage
* Add new parameters defaults
* Add new quantization parameters logic
* Minor refactoring as per the contributors' coding guidelines
* Update descriptions to match existing style
* Add llama_model_quantize_params parameters
* Add new quantize parameters parsing and validation
* Update usage
* Add new parameters defaults
* Add new quantization parameters logic
* Minor refactoring as per the contributors' guidelines
* Implement general --tensor-type instead of tensor-specific command option
* Fix implied type bug
* Restore missing #includes
* Add regex capability for tensor selection
* Refactor function name and update ALLOWED_TENSOR_TYPE
* Add missing #include
* Handle edge case when tensor name is cls.output
* Minor logging improvement
The current usage of the SYCL-Graph extension checks for
the `sycl_ext_oneapi_graph` device aspect. However, it is also
possible to support `sycl_ext_oneapi_limied_graph` devices that
don't support update
* SYCL: Add fp16 support to some elementwise OP kernels
* remove comment
ggml-ci
* Use static_cast directly
* remove not needed cast from tanh
* Use static cast and remove unneeded castings
* Adjust device_support_op for unary OPs
* Use cast_data and typed_data struct to deduplicate casting code
This commit adds a check for the visionos build version used with vtool
in build-xcframework.sh. The script now checks the Xcode version and
determines whether to use "xros" or "visionos" for the build version.
This commit also uses xcrun for the vtool so that the version of vtool
in xcode command line tools is used instead of the one in the system
path.
Refs: https://github.com/ggml-org/whisper.cpp/pull/2994#issuecomment-2773292223