-
Notifications
You must be signed in to change notification settings - Fork 30.5k
Closed
Description
System Info
transformers
version: 4.36.2- Platform: Linux-6.2.0-25-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.20.1
- Safetensors version: 0.4.1
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): not installed (NA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- Load a non-fast Tokenizer for mBART
- Add an additional special token to it
- Encode and then decode input containing previously added special token
from transformers import MBart50Tokenizer
tk = MBart50Tokenizer.from_pretrained('facebook/mbart-large-50')
tk.add_tokens('<token>', True)
print(tk.decode(tk("This is my example sentence with a special <token> token")["input_ids"]))
>>> 'en_XXThis is my example sentence with a special <token> token</s>'
This differs from the fast tokenizers' decoding scheme, as it will correctly decode the input with a space after en_XX
. I believe this is due to the implementation for legacy_added_tokens
in
transformers/src/transformers/tokenization_utils.py
Lines 1002 to 1022 in 3cefac1
legacy_added_tokens = set(self._added_tokens_encoder.keys()) - set(self.all_special_tokens) | { | |
token for token in self.additional_special_tokens if self.convert_tokens_to_ids(token) >= self.vocab_size | |
} | |
# To avoid mixing byte-level and unicode for byte-level BPT | |
# we need to build string separately for added tokens and byte-level tokens | |
# cf. https://github.com/huggingface/transformers/issues/1133 | |
sub_texts = [] | |
current_sub_text = [] | |
# TODO @ArthurZ in version 5, special tokens should be handled in convert_tokens_to_string, while _convert_tokens_to_string | |
for token in filtered_tokens: | |
if skip_special_tokens and token in self.all_special_ids: | |
continue | |
if token in legacy_added_tokens: | |
if current_sub_text: | |
string = self.convert_tokens_to_string(current_sub_text) | |
if len(string) > 0: | |
sub_texts.append(string) | |
current_sub_text = [] | |
sub_texts.append(token) | |
else: | |
current_sub_text.append(token) |
and more specifically the second part of the set definition for
legacy_added_tokens
that accounts for special tokens that have been added manually after loading (?)
When disabling the special handling for legacy_added_tokens
, the tokenization output would be correct, so I was primarily wondering for what reason this was added and whether removing this would potentially break other tokenizers.
Expected behavior
fast_tk = MBart50TokenizerFast.from_pretrained('facebook/mbart-large-50')
fast_tk.add_tokens('<token>', True)
print(fast_tk.decode(fast_tk("This is my example sentence with a special <token> token")["input_ids"])))
>>> 'en_XX This is my example sentence with a special <token> token</s>'
The decoding should match the fast tokenizers' output (?), at least I would assume so.
Metadata
Metadata
Assignees
Labels
No labels