-
Notifications
You must be signed in to change notification settings - Fork 30.5k
Closed
Description
Environment info
transformers
version: 3.3.1- Platform: Darwin-19.6.0-x86_64-i386-64bit
- Python version: 3.7.9
- PyTorch version (GPU?): 1.6.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help
@mfuntowicz, this seems to be an issue related to tokenization. So I hope you are the right person to ping here.
Information
Model I am using (Bert, XLNet ...):
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Minimal working example:
import os
os.makedirs('models/', exist_ok=True)
from transformers import GPT2Tokenizer
from tokenizers import ByteLevelBPETokenizer
open('train.txt', 'w').write('Training data including a <special> token.')
special_tokens = ['<special>']
bpe_tokenizer = ByteLevelBPETokenizer()
bpe_tokenizer.train(
files=['train.txt'],
special_tokens=special_tokens
)
bpe_tokenizer.save_model('models/')
gpt2_tokenizer = GPT2Tokenizer.from_pretrained(
'models/',
additional_special_tokens=special_tokens,
)
When encoding below text, the two tokenizers yield different outputs:
>>> text = 'A <special> token.'
>>> bpe_tokenizer.encode(text).tokens
['A', 'Ġ', '<special>', 'Ġ', 't', 'o', 'k', 'e', 'n', '.']
>>> gpt2_tokenizer.tokenize(text)
['A', '<special>', 't', 'o', 'k', 'e', 'n', '.'] # <----- Note the missing space (`Ġ`) around `<special>`
Expected behavior
I would expect that both tokenizers give the same output when encoding the sentence. Furthermore, because GPT2Tokenizer
seems to remove the spaces surrounding the special token, the decode(encode()) does not return the original string.
assert bpe_tokenizer.encode(text).tokens == ['A', 'Ġ', '<special>', 'Ġ', 't', 'o', 'k', 'e', 'n', '.']
assert gpt2_tokenizer.tokenize(text) == ['A', 'Ġ', '<special>', 'Ġ', 't', 'o', 'k', 'e', 'n', '.']
It is possible that I misunderstand the GPT2Tokenizer
API. Please advise if I should pass special_tokens
in a different way. Thank you in advance.
Metadata
Metadata
Assignees
Labels
No labels