Add quantize script for batch quantization (#92)

* Add quantize script for batch quantization

* Indentation

* README for new quantize.sh

* Fix script name

* Fix file list on Mac OS

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit is contained in:
Pavol Rusnak 2023-03-13 17:15:20 +01:00 committed by GitHub
parent 1808ee0500
commit d1f224712d
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
2 changed files with 18 additions and 31 deletions

View file

@ -145,44 +145,16 @@ python3 -m pip install torch numpy sentencepiece
python3 convert-pth-to-ggml.py models/7B/ 1
# quantize the model to 4-bits
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2
./quantize.sh 7B
# run the inference
./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128
```
For the bigger models, there are a few extra quantization steps. For example, for LLaMA-13B, converting to FP16 format
will create 2 ggml files, instead of one:
```bash
ggml-model-f16.bin
ggml-model-f16.bin.1
```
You need to quantize each of them separately like this:
```bash
./quantize ./models/13B/ggml-model-f16.bin ./models/13B/ggml-model-q4_0.bin 2
./quantize ./models/13B/ggml-model-f16.bin.1 ./models/13B/ggml-model-q4_0.bin.1 2
```
Everything else is the same. Simply run:
```bash
./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128
```
The number of files generated for each model is as follows:
```
7B -> 1 file
13B -> 2 files
30B -> 4 files
65B -> 8 files
```
When running the larger models, make sure you have enough disk space to store all the intermediate files.
TODO: add model disk/mem requirements
### Interactive mode
If you want a more ChatGPT-like experience, you can run in interactive mode by passing `-i` as a parameter.