Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where the batch_strides are overloaded to hold the row strides. Put the loads from the B matrix in the innermost loop because it should cache better. Share some code for reducing the result values to memory in mul_mat_vec_base. |
||
|---|---|---|
| .. | ||
| include | ||
| src | ||
| .gitignore | ||
| CMakeLists.txt | ||