How to use the gguf-split / Model sharding demo
              
              #6404
            
            Replies: 4 comments 20 replies
-
| Maybe I am wrong, but I couldn't make  | 
Beta Was this translation helpful? Give feedback.
-
| On Windows you can compile llama.cpp by opening a VS native tools command prompt (i.e.   | 
Beta Was this translation helpful? Give feedback.
-
| @dranger003 @phymbert may I ask how to compile gguf-split on MAC? (llamacpp) taozhiyu@603e5f4a42f1 llama.cpp-master % gguf-split | 
Beta Was this translation helpful? Give feedback.
-
| Please also include more clear and specific instructions for --merge? | 
Beta Was this translation helpful? Give feedback.


Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Context
Distributing and storing GGUFs is difficult for 70b+ models, especially on f16. Lot of issue can happen during file transfers, examples:
Typically, GGUFs need to be transferred from Hugging Face to an internal storage like s3, minio, git lfs, nexus or artifactory, then downloaded by the inference server and stored locally (or on a k8s PvC for example).
Storage solutions and filesystems poorly support large GGUF, typically HF does not support files larger than 50GB.
Such limits also exist on Artifactory.
Solution
We recently introduced
gguf-splitCLI and support the load of sharded GGUFs model inllama.cpp:Download a model
Convert to GGUF F16
python -u convert-hf-to-gguf.py \ ~/.cache/huggingface/hub/models--keyfan--grok-1-hf/snapshots/64e7373053c1bc7994ce427827b78ec11c181b3e/ \ --outfile grok-1-f16.gguf \ --outtype f16NOTE: Follow llama.cpp build instructions to generate all tools/cli:
make.Quantize (optional)
Build model shards
It is possible to use different sharding strategy:
--split-max-tensors 256--split-max-size 48GIt will produce 9 files with maximum 256 tensors in each.
You can then upload the sharded model to your HF Repo:
Files produced by
gguf-splitare valid GGUFs, so you can visualize them in HF.Load sharded model
llama_load_model_from_filewill detect the number of files and will load additional tensors from the rest of files.You may notice:
Load sharded model from a remote URL
Beta Was this translation helpful? Give feedback.
All reactions