llama : lookup word in vocab before doing BPE merges (#7193)

* fix: llama-3 ignore_merges

* test: add test for llama-3 bpe ignore_merges

* fix: set ignore_merges only for llama-3

* fix: test-tokenizer-1-bpe --ingore-merges detection

* fix: copy to fix fallthrough

* fix: change ignore_merges to bool

* fix: add ignore merges tests to cmake

* llama : alternative merge ignore logic

---------

Co-authored-by: Haoxiang Fei <feihaoxiang@idea.edu.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit is contained in:
Haoxiang Fei 2024-05-11 16:12:06 +08:00 committed by GitHub
parent 5ae3426b0b
commit f99e1e456e
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
5 changed files with 44 additions and 5 deletions

View file

@ -104,3 +104,5 @@ __ggml_vocab_test__
🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天 ------======= нещо на Български ''''''```````""""......!!!!!!?????? I've been 'told he's there, 'RE you sure? 'M not sure I'll make it, 'D you like some tea? We'Ve a'lL
__ggml_vocab_test__
Việt
__ggml_vocab_test__