Bug: tokenizer is missing merges section when converting using convert_hf_to_gguf.py

### What happened?

I ran into this issue when working on a [PR](https://github.com/huggingface/transformers/pull/31844) on HF where I was adding GGUF support for phi3 model. 

When using gguf-my-repo (or convert_hf_to_gguf.py) to convert from hugging face to gguf, merges is missing from the gguf file.

Below is an already converted  [TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tree/main) and you can see there is a merges section in the gguf tokenizer:

<img width="686" alt="older_tinyllama_has_merges" src="https://github.com/user-attachments/assets/effd3972-0e5b-4fcf-a73b-2b5a6018f685">


Here is a [tinyllama](https://huggingface.co/a8nova/TinyLlama-1.1B-Chat-v1.0-Q4_0-GGUF/tree/main) I converted few days ago via gguf-my-repo & it is missing merges from tokenizer:


<img width="696" alt="missing_merges" src="https://github.com/user-attachments/assets/77a4884e-4c2d-44e6-95cc-f3cdae30f9b8">




I was able to checkout llama.cpp & repro via:
```
python3.10 ./convert_hf_to_gguf.py TinyLlama-1.1B-Chat-v1.0 --outtype f16 --outfile TinyLlama-1.1B-Chat-v1.0-fp16.gguf
```

I am not familiar with the conversion script but I investigated and I think i understand the issue and I also have a fix:

* Case where `tokenizer.model` is present:
This bug can happen for any model class that calls `_set_vocab_sentencepiece()`. For the case where a `tokenizer.model` is present,  `_create_vocab_sentencepiece()`  [never throws an exception](https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py#L713),  and when we are back in `_set_vocab_sentencepiece()` load_merges is also not passed as True [here](https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py#L704), so this would be one place we would have to fix this.


* Case where `tokenizer.model` is not present and `tokenizer.json` is present:
This happens for the Llama family models only if `_set_vocab_llama_hf()` is invoked.
If the [self._set_vocab_sentencepiece() which is wrapped by a try-catch inside the LlamaModel class fails](https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py#L1473) which it does in my case since there is no tokenizer.model file for the llama model or phi3 but there is a tokenizer.json. For above case we can fix it in [convert_hf_to_gguf.py#L806](https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py#L806). I am able to fix it by passing `load_merges=True` to that line like:
```
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True, n_vocab=len(tokens))
```

If the above fixes make sense, I can create a PR!

### Name and Version

version: 3660 (b69a480a)
built with Apple clang version 15.0.0 (clang-1500.0.40.1) for arm64-apple-darwin23.1.0

### What operating system are you seeing the problem on?

Mac

### Relevant log output

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: tokenizer is missing merges section when converting using convert_hf_to_gguf.py #9309

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: tokenizer is missing merges section when converting using convert_hf_to_gguf.py #9309

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions