Skip to content

[Tokenizers] Port LLaMA Tokenizer and SentencePiece algorithm #6987

@ericstj

Description

@ericstj

The SentencePiece algorithm should be added to Microsoft.ML.Tokenizers. This is a dependency of LLaMATokenizer which we also wish to enable.

We can see reference implementations in
https://github.com/microsoft/BlingFire (MIT license)
https://github.com/google/sentencepiece (Apache license)
https://github.com/huggingface/tokenizers (Apache license)
https://huggingface.co/docs/transformers/main/en/model_doc/llama

Hugging face also has Llama2 - might be interesting to understand if that's also worth including or designing for later inclusion.

LLaMA Tokenizer:
https://arxiv.org/abs/2203.13474
https://arxiv.org/pdf/2203.13474.pdf

Sentence Piece:
https://arxiv.org/abs/1808.06226
https://arxiv.org/pdf/1808.06226.pdf

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions