Introduction of CUDA Graphs to LLama.cpp (#6766)

* DRAFT: Introduction of CUDA Graphs to LLama.cpp

* FIx issues raised in comments

* Tidied to now only use CUDA runtime (not mixed with driver calls)

* disable for multi-gpu and batch size > 1

* Disable CUDA graphs for old GPU arch and with env var

* added missing CUDA_CHECKs

* Addressed comments

* further addressed comments

* limit to GGML_ALLOW_CUDA_GRAPHS defined in llama.cpp cmake

* Added more comprehensive graph node checking

* With mechanism to fall back if graph capture fails

* Revert "With mechanism to fall back if graph capture fails"

This reverts commit eb9f15fb6fcb81384f732c4601a5b25c016a5143.

* Fall back if graph capture fails and address other comments

* - renamed GGML_ALLOW_CUDA_GRAPHS to GGML_CUDA_USE_GRAPHS

- rename env variable to disable CUDA graphs to GGML_CUDA_DISABLE_GRAPHS

- updated Makefile build to enable CUDA graphs

- removed graph capture failure checking in ggml_cuda_error
  using a global variable to track this is not thread safe, but I am also not safistied with checking an error by string
  if this is necessary to workaround some issues with graph capture with eg. cuBLAS, we can pass the ggml_backend_cuda_context to the error checking macro and store the result in the context

- fixed several resource leaks

- fixed issue with zero node graphs

- changed fixed size arrays to vectors

- removed the count of number of evaluations before start capturing, and instead changed the capture mode to relaxed

- removed the check for multiple devices so that it is still possible to use a single device, instead checks for split buffers to disable cuda graphs with -sm row

- changed the op for checking batch size to GGML_OP_ADD, should be more reliable than GGML_OP_SOFT_MAX

- code style fixes

- things to look into
  - VRAM usage of the cudaGraphExec_t, if it is significant we may need to make it optional
  - possibility of using cudaStreamBeginCaptureToGraph to keep track of which ggml graph nodes correspond to which cuda graph nodes

* fix build without cuda graphs

* remove outdated comment

* replace minimum cc value with a constant

---------

Co-authored-by: slaren <slarengh@gmail.com>

This commit is contained in:

agray3

2024-05-08 21:55:49 +01:00

• committed by

GitHub

parent c12452c7ae

commit bc4bba364f

No known key found for this signature in database

GPG key ID: B5690EEEBB952194

11 changed files with 372 additions and 44 deletions

6

ggml-cuda/mmvq.cu

View file

 @ -89,8 +89,7 @@ static void mul_mat_vec_q_cuda(
     GGML_ASSERT(ncols_x % qk == 0);
     GGML_ASSERT(ncols_y <= MMVQ_MAX_BATCH_SIZE);
     int id;
     CUDA_CHECK(cudaGetDevice(&id));
     int id = ggml_cuda_get_device();
     int64_t nwarps = 1;
     int64_t rows_per_cuda_block = 1;
 @ -328,8 +327,7 @@ void ggml_cuda_op_mul_mat_vec_q(
     const int64_t ne0 = dst->ne[0];
     int id;
     CUDA_CHECK(cudaGetDevice(&id));
     int id = ggml_cuda_get_device();
     // the main device has a larger memory buffer to hold the results from all GPUs
     // nrows_dst == nrows of the matrix that the kernel writes into

Rows
Columns

Introduction of CUDA Graphs to LLama.cpp (#6766)

6 ggml-cuda/mmvq.cu Unescape Escape View file

6

ggml-cuda/mmvq.cu

View file