Skip to content

Removing tokens from the tokenizer #15032

@snoop2head

Description

@snoop2head

Any methods that I can remove unwanted tokens from the tokenizer?

Referring to #4827 , I tried to remove tokens from the tokenizer with the following code.

First, I fetch the tokenizer from huggingface hub.

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("klue/roberta-base")
print(len(tokenizer.vocab))
32000

From the fetched tokenizer, I tried to remove tokens such as [unused363].
So I first extracted tokens with 'unused' and deleted afterwards.

# get all tokens with "unused" in target_tokenizer
unwanted_words = []
for word in tokenizer.vocab:
    if "unused" in word:
        unwanted_words.append(word)

# remove all unwanted tokens from target_tokenizer
for word in unwanted_words:
    del tokenizer.vocab[word]

print(len(tokenizer.vocab))
32000

Apparently, del didn't do its job.
The list unwanted_words has 500 elements but none of which are removed from the tokenizer.

Any other methods that I can refer to?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions