-
Notifications
You must be signed in to change notification settings - Fork 30.9k
Closed
Description
Any methods that I can remove unwanted tokens from the tokenizer?
Referring to #4827 , I tried to remove tokens from the tokenizer with the following code.
First, I fetch the tokenizer from huggingface hub.
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("klue/roberta-base")
print(len(tokenizer.vocab))32000
From the fetched tokenizer, I tried to remove tokens such as [unused363].
So I first extracted tokens with 'unused' and deleted afterwards.
# get all tokens with "unused" in target_tokenizer
unwanted_words = []
for word in tokenizer.vocab:
if "unused" in word:
unwanted_words.append(word)
# remove all unwanted tokens from target_tokenizer
for word in unwanted_words:
del tokenizer.vocab[word]
print(len(tokenizer.vocab))32000
Apparently, del didn't do its job.
The list unwanted_words has 500 elements but none of which are removed from the tokenizer.
Any other methods that I can refer to?
Metadata
Metadata
Assignees
Labels
No labels