llama.cpp/ggml-cuda
Johannes Gäßler 76d66ee0be
CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (#7921)
* CUDA: faster q2_K, q3_K MMQ + int8 tensor cores

* try CI fix

* try CI fix

* try CI fix

* fix data race

* rever q2_K precision related changes
2024-06-14 18:41:49 +02:00
..
template-instances
acc.cu
acc.cuh
arange.cu
arange.cuh
argsort.cu CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (#7921) 2024-06-14 18:41:49 +02:00
argsort.cuh
binbcast.cu
binbcast.cuh
clamp.cu
clamp.cuh
common.cuh CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (#7921) 2024-06-14 18:41:49 +02:00
concat.cu
concat.cuh
convert.cu
convert.cuh
cpy.cu
cpy.cuh
dequantize.cuh
diagmask.cu
diagmask.cuh
dmmv.cu
dmmv.cuh
fattn-common.cuh CUDA: use tensor cores for MMQ (#7676) 2024-06-10 11:45:13 +02:00
fattn-tile-f16.cu CUDA: use tensor cores for MMQ (#7676) 2024-06-10 11:45:13 +02:00
fattn-tile-f16.cuh
fattn-tile-f32.cu
fattn-tile-f32.cuh
fattn-vec-f16.cuh CUDA: use tensor cores for MMQ (#7676) 2024-06-10 11:45:13 +02:00
fattn-vec-f32.cuh CUDA: fix broken oob check for FA vec f32 kernel (#7904) 2024-06-12 17:41:51 +02:00
fattn-wmma-f16.cuh CUDA: use tensor cores for MMQ (#7676) 2024-06-10 11:45:13 +02:00
fattn.cu
fattn.cuh
getrows.cu
getrows.cuh
im2col.cu
im2col.cuh
mma.cuh CUDA: int8 tensor cores for MMQ (q4_K, q5_K, q6_K) (#7860) 2024-06-11 08:26:07 +02:00
mmq.cu
mmq.cuh CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (#7921) 2024-06-14 18:41:49 +02:00
mmvq.cu
mmvq.cuh
norm.cu
norm.cuh
pad.cu
pad.cuh
pool2d.cu
pool2d.cuh
quantize.cu
quantize.cuh
rope.cu
rope.cuh
scale.cu
scale.cuh
softmax.cu CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (#7921) 2024-06-14 18:41:49 +02:00
softmax.cuh
sumrows.cu
sumrows.cuh
tsembd.cu
tsembd.cuh
unary.cu tests : add non-cont unary tests (#7857) 2024-06-12 16:00:22 +03:00
unary.cuh
upscale.cu
upscale.cuh
vecdotq.cuh CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (#7921) 2024-06-14 18:41:49 +02:00