[MBart50] Inconsistent decoding with additional special tokens between slow and fast tokenizers 

### System Info

- `transformers` version: 4.36.2
- Platform: Linux-6.2.0-25-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.20.1
- Safetensors version: 0.4.1
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): not installed (NA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No

### Who can help?

@ArthurZucker @younesbelkada 

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

1. Load a non-fast Tokenizer for mBART
2. Add an additional special token to it
3. Encode and then decode input containing previously added special token

```python3
from transformers import MBart50Tokenizer

tk = MBart50Tokenizer.from_pretrained('facebook/mbart-large-50')
tk.add_tokens('<token>', True)
print(tk.decode(tk("This is my example sentence with a special <token> token")["input_ids"]))
>>> 'en_XXThis is my example sentence with a special <token> token</s>'
```
This differs from the fast tokenizers' decoding scheme, as it will correctly decode the input with a space after `en_XX`. I believe this is due to the implementation for `legacy_added_tokens` in https://github.com/huggingface/transformers/blob/3cefac1d974db5e2825a0cb2b842883a628be7a0/src/transformers/tokenization_utils.py#L1002-L1022
and more specifically the second part of the set definition for `legacy_added_tokens` that accounts for special tokens that have been added manually after loading (?)

When disabling the special handling for `legacy_added_tokens`, the tokenization output would be correct, so I was primarily wondering for what reason this was added and whether removing this would potentially break other tokenizers.

### Expected behavior

```python3
fast_tk = MBart50TokenizerFast.from_pretrained('facebook/mbart-large-50')
fast_tk.add_tokens('<token>', True)
print(fast_tk.decode(fast_tk("This is my example sentence with a special <token> token")["input_ids"])))
>>> 'en_XX This is my example sentence with a special <token> token</s>'
```
The decoding should match the fast tokenizers' output (?), at least I would assume so.

	legacy_added_tokens = set(self._added_tokens_encoder.keys()) - set(self.all_special_tokens) \| {
	token for token in self.additional_special_tokens if self.convert_tokens_to_ids(token) >= self.vocab_size
	}
	# To avoid mixing byte-level and unicode for byte-level BPT
	# we need to build string separately for added tokens and byte-level tokens
	# cf. https://github.com/huggingface/transformers/issues/1133
	sub_texts = []
	current_sub_text = []
	# TODO @ArthurZ in version 5, special tokens should be handled in convert_tokens_to_string, while _convert_tokens_to_string
	for token in filtered_tokens:
	if skip_special_tokens and token in self.all_special_ids:
	continue
	if token in legacy_added_tokens:
	if current_sub_text:
	string = self.convert_tokens_to_string(current_sub_text)
	if len(string) > 0:
	sub_texts.append(string)
	current_sub_text = []
	sub_texts.append(token)
	else:
	current_sub_text.append(token)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MBart50] Inconsistent decoding with additional special tokens between slow and fast tokenizers #28287

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[MBart50] Inconsistent decoding with additional special tokens between slow and fast tokenizers #28287

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions