-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Open
Labels
P2Priority of the issue for triage purpose: Needs to be fixed at some point.Priority of the issue for triage purpose: Needs to be fixed at some point.enhancementNew feature or requestNew feature or request
Milestone
Description
The WordPiece algorithm should be added to Microsoft.ML.Tokenizers. WordPiece algorithm is the basis for BERTTokenizer-based models. Needed for E5
We can see reference implementations in
https://github.com/microsoft/BlingFire (MIT license)
https://github.com/huggingface/tokenizers (Apache license)
The paper which it's based on:
https://arxiv.org/abs/1609.08144
https://arxiv.org/pdf/1609.08144.pdf
Metadata
Metadata
Assignees
Labels
P2Priority of the issue for triage purpose: Needs to be fixed at some point.Priority of the issue for triage purpose: Needs to be fixed at some point.enhancementNew feature or requestNew feature or request