- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.5k
Work on the BPE tokenizer #3252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
bfaab6f
              89e74c6
              7770423
              37cf135
              91a527a
              208d3d7
              407f76d
              311fcf1
              c85cb29
              048e659
              c0990bb
              1b7c369
              a4e9448
              17ca832
              4abbfb5
              59a30b7
              a6070b7
              16c06fe
              c09330e
              9cfb714
              607e3bf
              fad8a77
              6a16c36
              a2ddaad
              3fa8c55
              d6d7d0f
              37af613
              2117e23
              28778f8
              a9a2af9
              dccd1db
              02b9ccf
              3d162cc
              5aee498
              3e518e2
              File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
|  | @@ -20,28 +20,6 @@ | |
| import gguf | ||
|  | ||
|  | ||
| def bytes_to_unicode(): | ||
| # ref: https://github.com/openai/gpt-2/blob/master/src/encoder.py | ||
| """ | ||
| Returns list of utf-8 byte and a corresponding list of unicode strings. | ||
| The reversible bpe codes work on unicode strings. | ||
| This means you need a large # of unicode characters in your vocab if you want to avoid UNKs. | ||
| When you're at something like a 10B token dataset you end up needing around 5K for decent coverage. | ||
| This is a significant percentage of your normal, say, 32K bpe vocab. | ||
| To avoid that, we want lookup tables between utf-8 bytes and unicode strings. | ||
| And avoids mapping to whitespace/control characters the bpe code barfs on. | ||
| """ | ||
| bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1)) | ||
| cs = bs[:] | ||
| n = 0 | ||
| for b in range(2**8): | ||
| if b not in bs: | ||
| bs.append(b) | ||
| cs.append(2**8+n) | ||
| n += 1 | ||
| return dict(zip(bs, (chr(n) for n in cs))) | ||
|  | ||
|  | ||
| def count_model_parts(dir_model: Path) -> int: | ||
| num_parts = 0 | ||
| for filename in os.listdir(dir_model): | ||
|  | @@ -133,6 +111,8 @@ def parse_args() -> argparse.Namespace: | |
| print("gguf: get tokenizer metadata") | ||
|  | ||
| tokens: list[bytearray] = [] | ||
| scores: list[float] = [] | ||
| toktypes: list[int] = [] | ||
|  | ||
| tokenizer_json_file = dir_model / 'tokenizer.json' | ||
| if not tokenizer_json_file.is_file(): | ||
|  | @@ -155,28 +135,15 @@ def parse_args() -> argparse.Namespace: | |
| tokenizer = AutoTokenizer.from_pretrained(dir_model) | ||
|  | ||
| reverse_vocab = {id: encoded_tok for encoded_tok, id in tokenizer.vocab.items()} | ||
| byte_encoder = bytes_to_unicode() | ||
| byte_decoder = {v: k for k, v in byte_encoder.items()} | ||
|  | ||
| for i in range(vocab_size): | ||
| if i in reverse_vocab: | ||
| try: | ||
| text = bytearray([byte_decoder[c] for c in reverse_vocab[i]]) | ||
| except KeyError: | ||
| text = bytearray() | ||
| for c in reverse_vocab[i]: | ||
| if ord(c) < 256: # single byte character | ||
| text.append(byte_decoder[ord(c)]) | ||
| else: # multibyte special token character | ||
| text.extend(c.encode('utf-8')) | ||
| else: | ||
| print(f"Key {i} not in tokenizer vocabulary. Padding with an arbitrary token.") | ||
| pad_token = f"[PAD{i}]".encode("utf8") | ||
| text = bytearray(pad_token) | ||
|  | ||
| tokens.append(text) | ||
| tokens.append(reverse_vocab[i]) | ||
| scores.append(0.0) # dummy | ||
| toktypes.append(gguf.TokenType.NORMAL) | ||
| 
      Comment on lines
    
      +140
     to 
      +142
    
   There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This doesn't work as-is, because scores and toktypes were removed from this file in a previous PR. Also, won't this throw KeyError now if the model's vocab_size is larger than reverse_vocab? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And in order to satisfy the needs specified in #3405, we will need to at least provide a way (via toktypes) to differentiate between added tokens ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 
 Until now I've only thought about the one-to-one correspondence between piece and token-id in  There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 
 I wrote about my current understanding of added tokens here. In short: all added tokens looked like CONTROL tokens to me for now (maybe because I have to see  There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 
 One more remark: as far as I remember scores are needed for BPE merges. Token types NORMAL and CONTROL are used in this PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The scores and toktypes were only removed because they were only placeholder values, since the GGUF spec says they are optional. If they are used for something meaningful in the future they can certainly be added back. Should these lines be removed, if they do not apply to Falcon? They necessarily imply that  # The number of tokens in tokenizer.json can differ from the expected vocab size.
# This causes downstream issues with mismatched tensor sizes when running the inference
vocab_size = hparams["vocab_size"] if "vocab_size" in hparams else len(tokenizer_json["model"]["vocab"])There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 
 I removed unused code from the conversion scripts but would not touch that code until we agree that  There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 
 Oops, missed the impact of this completely when merging  | ||
|  | ||
| gguf_writer.add_token_list(tokens) | ||
| gguf_writer.add_token_scores(scores) | ||
| gguf_writer.add_token_types(toktypes) | ||
|  | ||
| special_vocab = gguf.SpecialVocab(dir_model, load_merges = True) | ||
| special_vocab.add_to_gguf(gguf_writer) | ||
|  | ||
Uh oh!
There was an error while loading. Please reload this page.