Skip to content

[Tokenizers] Implement WordPiece algorithm #6988

@ericstj

Description

@ericstj

The WordPiece algorithm should be added to Microsoft.ML.Tokenizers. WordPiece algorithm is the basis for BERTTokenizer-based models. Needed for E5

We can see reference implementations in
https://github.com/microsoft/BlingFire (MIT license)
https://github.com/huggingface/tokenizers (Apache license)

The paper which it's based on:
https://arxiv.org/abs/1609.08144
https://arxiv.org/pdf/1609.08144.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Priority of the issue for triage purpose: Needs to be fixed at some point.enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions