metal : matrix-matrix multiplication kernel (#2615)
* metal: matrix-matrix multiplication kernel This commit removes MPS and uses custom matrix-matrix multiplication kernels for all quantization types. This commit also adds grouped-query attention to support llama2 70B. * metal: fix performance degradation from gqa Integers are slow on the GPU, and 64-bit divides are extremely slow. In the context of GQA, we introduce a 64-bit divide that cannot be optimized out by the compiler, which results in a decrease of ~8% in inference performance. This commit fixes that issue by calculating a part of the offset with a 32-bit divide. Naturally, this limits the size of a single matrix to ~4GB. However, this limitation should suffice for the near future. * metal: fix bugs for GQA and perplexity test. I mixed up ne02 and nb02 in previous commit.
This commit is contained in:
parent
b5ffb2849d
commit
bf83bff674
6 changed files with 528 additions and 636 deletions
2
Makefile
2
Makefile
|
@ -283,7 +283,7 @@ endif # LLAMA_CLBLAST
|
|||
ifdef LLAMA_METAL
|
||||
CFLAGS += -DGGML_USE_METAL -DGGML_METAL_NDEBUG
|
||||
CXXFLAGS += -DGGML_USE_METAL
|
||||
LDFLAGS += -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
|
||||
LDFLAGS += -framework Foundation -framework Metal -framework MetalKit
|
||||
OBJS += ggml-metal.o
|
||||
endif # LLAMA_METAL
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue