Skip to content

Bug: tokenizer is missing merges section when converting using convert_hf_to_gguf.py #9309

@a8nova

Description

@a8nova

What happened?

I ran into this issue when working on a PR on HF where I was adding GGUF support for phi3 model.

When using gguf-my-repo (or convert_hf_to_gguf.py) to convert from hugging face to gguf, merges is missing from the gguf file.

Below is an already converted TinyLlama-1.1B-Chat-v1.0-GGUF and you can see there is a merges section in the gguf tokenizer:

older_tinyllama_has_merges

Here is a tinyllama I converted few days ago via gguf-my-repo & it is missing merges from tokenizer:

missing_merges

I was able to checkout llama.cpp & repro via:

python3.10 ./convert_hf_to_gguf.py TinyLlama-1.1B-Chat-v1.0 --outtype f16 --outfile TinyLlama-1.1B-Chat-v1.0-fp16.gguf

I am not familiar with the conversion script but I investigated and I think i understand the issue and I also have a fix:

  • Case where tokenizer.model is present:
    This bug can happen for any model class that calls _set_vocab_sentencepiece(). For the case where a tokenizer.model is present, _create_vocab_sentencepiece() never throws an exception, and when we are back in _set_vocab_sentencepiece() load_merges is also not passed as True here, so this would be one place we would have to fix this.

  • Case where tokenizer.model is not present and tokenizer.json is present:
    This happens for the Llama family models only if _set_vocab_llama_hf() is invoked.
    If the self._set_vocab_sentencepiece() which is wrapped by a try-catch inside the LlamaModel class fails which it does in my case since there is no tokenizer.model file for the llama model or phi3 but there is a tokenizer.json. For above case we can fix it in convert_hf_to_gguf.py#L806. I am able to fix it by passing load_merges=True to that line like:

special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True, n_vocab=len(tokens))

If the above fixes make sense, I can create a PR!

Name and Version

version: 3660 (b69a480)
built with Apple clang version 15.0.0 (clang-1500.0.40.1) for arm64-apple-darwin23.1.0

What operating system are you seeing the problem on?

Mac

Relevant log output

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)stale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions