llama.cpp/ggml/src/ggml-cuda/fattn-wmma-f16.cuh at 658987cfc9d752dca7758987390d5fb1a7a0a54a - ver4a/llama.cpp - git.uncontrol.me

ver4a/llama.cpp

Johannes Gäßler 864a0b67a6

CUDA: use mma PTX instructions for FlashAttention (#11583 )

* CUDA: use mma PTX instructions for FlashAttention

* __shfl_sync workaround for movmatrix

* add __shfl_sync to HIP

Co-authored-by: Diego Devesa <slarengh@gmail.com>

2025-02-02 19:31:09 +01:00

3 lines

115 B

Text

Raw Blame History

	`#include "common.cuh"`

	`void ggml_cuda_flash_attn_ext_wmma_f16(ggml_backend_cuda_context & ctx, ggml_tensor * dst);`