-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
The SentencePiece algorithm should be added to Microsoft.ML.Tokenizers. This is a dependency of LLaMATokenizer which we also wish to enable.
We can see reference implementations in
https://github.com/microsoft/BlingFire (MIT license)
https://github.com/google/sentencepiece (Apache license)
https://github.com/huggingface/tokenizers (Apache license)
https://huggingface.co/docs/transformers/main/en/model_doc/llama
Hugging face also has Llama2 - might be interesting to understand if that's also worth including or designing for later inclusion.
LLaMA Tokenizer:
https://arxiv.org/abs/2203.13474
https://arxiv.org/pdf/2203.13474.pdf
Sentence Piece:
https://arxiv.org/abs/1808.06226
https://arxiv.org/pdf/1808.06226.pdf