GPT2Tokenizer strips spaces surrounding special tokens

## Environment info
     
- `transformers` version: 3.3.1
- Platform: Darwin-19.6.0-x86_64-i386-64bit
- Python version: 3.7.9
- PyTorch version (GPU?): 1.6.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No

### Who can help

@mfuntowicz, this seems to be an issue related to tokenization. So I hope you are the right person to ping here. 

## Information

Model I am using (Bert, XLNet ...):

The problem arises when using:
* [ ] the official example scripts: (give details below)
* [x] my own modified scripts: (give details below)

The tasks I am working on is:
* [ ] an official GLUE/SQUaD task: (give the name)
* [x] my own task or dataset: (give details below)

## To reproduce

Minimal working example:

```py
import os
os.makedirs('models/', exist_ok=True)

from transformers import GPT2Tokenizer
from tokenizers import ByteLevelBPETokenizer


open('train.txt', 'w').write('Training data including a <special> token.')

special_tokens = ['<special>']

bpe_tokenizer = ByteLevelBPETokenizer()
bpe_tokenizer.train(
    files=['train.txt'],
    special_tokens=special_tokens
)
bpe_tokenizer.save_model('models/')

gpt2_tokenizer = GPT2Tokenizer.from_pretrained(
    'models/',
    additional_special_tokens=special_tokens,
)
```

When encoding below text, the two tokenizers yield different outputs:

```py
>>> text = 'A <special> token.'
>>> bpe_tokenizer.encode(text).tokens
['A', 'Ġ', '<special>', 'Ġ', 't', 'o', 'k', 'e', 'n', '.']
>>> gpt2_tokenizer.tokenize(text)
['A', '<special>', 't', 'o', 'k', 'e', 'n', '.'] # <----- Note the missing space (`Ġ`) around `<special>` 
```

## Expected behavior

I would expect that both tokenizers give the same output when encoding the sentence. Furthermore, because `GPT2Tokenizer` seems to remove the spaces surrounding the special token, the decode(encode()) does not return the original string. 

```py
assert bpe_tokenizer.encode(text).tokens == ['A', 'Ġ', '<special>', 'Ġ', 't', 'o', 'k', 'e', 'n', '.']
assert gpt2_tokenizer.tokenize(text) == ['A', 'Ġ', '<special>', 'Ġ', 't', 'o', 'k', 'e', 'n', '.']
```

It is possible that I misunderstand the `GPT2Tokenizer` API. Please advise if I should pass `special_tokens` in a different way. Thank you in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPT2Tokenizer strips spaces surrounding special tokens #7901

Environment info

Who can help

Information

To reproduce

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPT2Tokenizer strips spaces surrounding special tokens #7901

Description

Environment info

Who can help

Information

To reproduce

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions