Skip to content

Implement Sentencepiece Unigram tokenizer #7186

@arthurvb

Description

@arthurvb

Is your feature request related to a problem? Please describe.
I want to use a multilingual model from Huggingface ( https://huggingface.co/intfloat/multilingual-e5-large ) and the tokenizer is a sentencepiece unigram tokenizer, so I am unable to port it to C#/ONNX

Describe the solution you'd like
Support for the unigram sentencepiece tokenizer in the Microsoft.ML.Tokenizers package.

Describe alternatives you've considered
Blingfire, but seems not maintained anymore and unclear if it would return exactly the same token-id's.

Thank you for your time and effort (the library in general is great!)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions