CUDA: use mma PTX instructions for FlashAttention (#11583)

* CUDA: use mma PTX instructions for FlashAttention

* __shfl_sync workaround for movmatrix

* add __shfl_sync to HIP

Co-authored-by: Diego Devesa <slarengh@gmail.com>
This commit is contained in:
Johannes Gäßler 2025-02-02 19:31:09 +01:00 committed by GitHub
parent 84ec8a58f7
commit 864a0b67a6
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
29 changed files with 2058 additions and 998 deletions

View file

@ -132,7 +132,7 @@ bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11) {
return false;
}
if (int8_mma_available(cc)) {
if (new_mma_available(cc)) {
return true;
}