Skip to content

GPT2Tokenizer strips spaces surrounding special tokens #7901

@jantrienes

Description

@jantrienes

Environment info

  • transformers version: 3.3.1
  • Platform: Darwin-19.6.0-x86_64-i386-64bit
  • Python version: 3.7.9
  • PyTorch version (GPU?): 1.6.0 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

@mfuntowicz, this seems to be an issue related to tokenization. So I hope you are the right person to ping here.

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Minimal working example:

import os
os.makedirs('models/', exist_ok=True)

from transformers import GPT2Tokenizer
from tokenizers import ByteLevelBPETokenizer


open('train.txt', 'w').write('Training data including a <special> token.')

special_tokens = ['<special>']

bpe_tokenizer = ByteLevelBPETokenizer()
bpe_tokenizer.train(
    files=['train.txt'],
    special_tokens=special_tokens
)
bpe_tokenizer.save_model('models/')

gpt2_tokenizer = GPT2Tokenizer.from_pretrained(
    'models/',
    additional_special_tokens=special_tokens,
)

When encoding below text, the two tokenizers yield different outputs:

>>> text = 'A <special> token.'
>>> bpe_tokenizer.encode(text).tokens
['A', 'Ġ', '<special>', 'Ġ', 't', 'o', 'k', 'e', 'n', '.']
>>> gpt2_tokenizer.tokenize(text)
['A', '<special>', 't', 'o', 'k', 'e', 'n', '.'] # <----- Note the missing space (`Ġ`) around `<special>` 

Expected behavior

I would expect that both tokenizers give the same output when encoding the sentence. Furthermore, because GPT2Tokenizer seems to remove the spaces surrounding the special token, the decode(encode()) does not return the original string.

assert bpe_tokenizer.encode(text).tokens == ['A', 'Ġ', '<special>', 'Ġ', 't', 'o', 'k', 'e', 'n', '.']
assert gpt2_tokenizer.tokenize(text) == ['A', 'Ġ', '<special>', 'Ġ', 't', 'o', 'k', 'e', 'n', '.']

It is possible that I misunderstand the GPT2Tokenizer API. Please advise if I should pass special_tokens in a different way. Thank you in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions