You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: tools/server/README.md
+25-16Lines changed: 25 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,9 +30,10 @@ The project is under active development, and we are [looking for feedback and co
30
30
| -------- | ----------- |
31
31
|`-h, --help, --usage`| print usage and exit |
32
32
|`--version`| show version and build info |
33
+
|`-cl, --cache-list`| show list of models in cache |
33
34
|`--completion-bash`| print source-able bash completion script for llama.cpp |
34
35
|`--verbose-prompt`| print a verbose prompt before generation (default: false) |
35
-
|`-t, --threads N`| number of threads to use during generation (default: -1)<br/>(env: LLAMA_ARG_THREADS) |
36
+
|`-t, --threads N`| number of CPU threads to use during generation (default: -1)<br/>(env: LLAMA_ARG_THREADS) |
36
37
|`-tb, --threads-batch N`| number of threads to use during batch and prompt processing (default: same as --threads) |
37
38
|`-C, --cpu-mask M`| CPU affinity mask: arbitrarily long hex. Complements cpu-range (default: "") |
38
39
|`-Cr, --cpu-range lo-hi`| range of CPUs for affinity. Complements --cpu-mask |
@@ -51,7 +52,7 @@ The project is under active development, and we are [looking for feedback and co
51
52
|`--keep N`| number of tokens to keep from the initial prompt (default: 0, -1 = all) |
52
53
|`--swa-full`| use full-size SWA cache (default: false)<br/>[(more info)](https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)<br/>(env: LLAMA_ARG_SWA_FULL) |
53
54
|`--kv-unified, -kvu`| use single unified KV buffer for the KV cache of all sequences (default: false)<br/>[(more info)](https://github.com/ggml-org/llama.cpp/pull/14363)<br/>(env: LLAMA_ARG_KV_SPLIT) |
@@ -78,7 +80,7 @@ The project is under active development, and we are [looking for feedback and co
78
80
|`--override-tensor, -ot <tensor name pattern>=<buffer type>,...`| override tensor buffer type |
79
81
|`--cpu-moe, -cmoe`| keep all Mixture of Experts (MoE) weights in the CPU<br/>(env: LLAMA_ARG_CPU_MOE) |
80
82
|`--n-cpu-moe, -ncmoe N`| keep the Mixture of Experts (MoE) weights of the first N layers in the CPU<br/>(env: LLAMA_ARG_N_CPU_MOE) |
81
-
|`-ngl, --gpu-layers, --n-gpu-layers N`| number of layers to store in VRAM<br/>(env: LLAMA_ARG_N_GPU_LAYERS) |
83
+
|`-ngl, --gpu-layers, --n-gpu-layers N`|max. number of layers to store in VRAM (default: -1)<br/>(env: LLAMA_ARG_N_GPU_LAYERS) |
82
84
|`-sm, --split-mode {none,layer,row}`| how to split the model across multiple GPUs, one of:<br/>- none: use one GPU only<br/>- layer (default): split layers and KV across GPUs<br/>- row: split rows across GPUs<br/>(env: LLAMA_ARG_SPLIT_MODE) |
83
85
|`-ts, --tensor-split N0,N1,N2,...`| fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1<br/>(env: LLAMA_ARG_TENSOR_SPLIT) |
84
86
|`-mg, --main-gpu INDEX`| the GPU to use for the model (with split-mode = none), or for intermediate results and KV (with split-mode = row) (default: 0)<br/>(env: LLAMA_ARG_MAIN_GPU) |
@@ -92,6 +94,7 @@ The project is under active development, and we are [looking for feedback and co
92
94
|`--control-vector-layer-range START END`| layer range to apply the control vector(s) to, start and end inclusive |
93
95
|`-m, --model FNAME`| model path (default: `models/$filename` with filename from `--hf-file` or `--model-url` if set, otherwise models/7B/ggml-model-f16.gguf)<br/>(env: LLAMA_ARG_MODEL) |
94
96
|`-mu, --model-url MODEL_URL`| model download url (default: unused)<br/>(env: LLAMA_ARG_MODEL_URL) |
97
+
|`-dr, --docker-repo [<repo>/]<model>[:quant]`| Docker Hub model repository. repo is optional, default to ai/. quant is optional, default to :latest.<br/>example: gemma3<br/>(default: unused)<br/>(env: LLAMA_ARG_DOCKER_REPO) |
95
98
|`-hf, -hfr, --hf-repo <user>/<model>[:quant]`| Hugging Face model repository; quant is optional, case-insensitive, default to Q4_K_M, or falls back to the first file in the repo if Q4_K_M doesn't exist.<br/>mmproj is also downloaded automatically if available. to disable, add --no-mmproj<br/>example: unsloth/phi-4-GGUF:q4_k_m<br/>(default: unused)<br/>(env: LLAMA_ARG_HF_REPO) |
96
99
|`-hfd, -hfrd, --hf-repo-draft <user>/<model>[:quant]`| Same as --hf-repo, but for the draft model (default: unused)<br/>(env: LLAMA_ARG_HFD_REPO) |
97
100
|`-hff, --hf-file FILE`| Hugging Face model file. If specified, it will override the quant in --hf-repo (default: unused)<br/>(env: LLAMA_ARG_HF_FILE) |
@@ -100,7 +103,7 @@ The project is under active development, and we are [looking for feedback and co
100
103
|`-hft, --hf-token TOKEN`| Hugging Face access token (default: value from HF_TOKEN environment variable)<br/>(env: HF_TOKEN) |
|`--log-colors [on\|off\|auto]`|Set colored logging ('on', 'off', or 'auto', default: 'auto')<br/>'auto' enables colors when output is to a terminal<br/>(env: LLAMA_LOG_COLORS) |
104
107
|`-v, --verbose, --log-verbose`| Set verbosity level to infinity (i.e. log all messages, useful for debugging) |
105
108
|`--offline`| Offline mode: forces use of cache, prevents network access<br/>(env: LLAMA_OFFLINE) |
106
109
|`-lv, --verbosity, --log-verbosity N`| Set the verbosity threshold. Messages with a higher verbosity will be ignored.<br/>(env: LLAMA_LOG_VERBOSITY) |
@@ -151,7 +154,8 @@ The project is under active development, and we are [looking for feedback and co
151
154
152
155
| Argument | Explanation |
153
156
| -------- | ----------- |
154
-
|`--swa-checkpoints N`| max number of SWA checkpoints per slot to create (default: 3)<br/>[(more info)](https://github.com/ggml-org/llama.cpp/pull/15293)<br/>(env: LLAMA_ARG_SWA_CHECKPOINTS) |
157
+
|`--ctx-checkpoints, --swa-checkpoints N`| max number of context checkpoints to create per slot (default: 8)<br/>[(more info)](https://github.com/ggml-org/llama.cpp/pull/15293)<br/>(env: LLAMA_ARG_CTX_CHECKPOINTS) |
158
+
|`--cache-ram, -cram N`| set the maximum cache size in MiB (default: 8192, -1 - no limit, 0 - disable)<br/>[(more info)](https://github.com/ggml-org/llama.cpp/pull/16391)<br/>(env: LLAMA_ARG_CACHE_RAM) |
155
159
|`--no-context-shift`| disables context shift on infinite text generation (default: enabled)<br/>(env: LLAMA_ARG_NO_CONTEXT_SHIFT) |
156
160
|`--context-shift`| enables context shift on infinite text generation (default: disabled)<br/>(env: LLAMA_ARG_CONTEXT_SHIFT) |
157
161
|`-r, --reverse-prompt PROMPT`| halt generation at PROMPT, return control in interactive mode<br/> |
@@ -165,6 +169,8 @@ The project is under active development, and we are [looking for feedback and co
165
169
|`--mmproj-url URL`| URL to a multimodal projector file. see tools/mtmd/README.md<br/>(env: LLAMA_ARG_MMPROJ_URL) |
166
170
|`--no-mmproj`| explicitly disable multimodal projector, useful when using -hf<br/>(env: LLAMA_ARG_NO_MMPROJ) |
167
171
|`--no-mmproj-offload`| do not offload multimodal projector to GPU<br/>(env: LLAMA_ARG_NO_MMPROJ_OFFLOAD) |
172
+
|`--image-min-tokens N`| minimum number of tokens each image can take, only used by vision models with dynamic resolution (default: read from model)<br/>(env: LLAMA_ARG_IMAGE_MIN_TOKENS) |
173
+
|`--image-max-tokens N`| maximum number of tokens each image can take, only used by vision models with dynamic resolution (default: read from model)<br/>(env: LLAMA_ARG_IMAGE_MAX_TOKENS) |
168
174
|`--override-tensor-draft, -otd <tensor name pattern>=<buffer type>,...`| override tensor buffer type for draft model |
169
175
|`--cpu-moe-draft, -cmoed`| keep all Mixture of Experts (MoE) weights in the CPU for the draft model<br/>(env: LLAMA_ARG_CPU_MOE_DRAFT) |
170
176
|`--n-cpu-moe-draft, -ncmoed N`| keep the Mixture of Experts (MoE) weights of the first N layers in the CPU for the draft model<br/>(env: LLAMA_ARG_N_CPU_MOE_DRAFT) |
@@ -189,13 +195,14 @@ The project is under active development, and we are [looking for feedback and co
|`--slot-save-path PATH`| path to save slot kv cache (default: disabled) |
192
-
|`--jinja`| use jinja template for chat (default: disabled)<br/>(env: LLAMA_ARG_JINJA) |
193
-
|`--reasoning-format FORMAT`| controls whether thought tags are allowed and/or extracted from the response, and in which format they're returned; one of:<br/>- none: leaves thoughts unparsed in `message.content`<br/>- deepseek: puts thoughts in `message.reasoning_content`<br/>- deepseek-legacy: keeps `<think>` tags in `message.content` while also populating `message.reasoning_content`<br/>(default: deepseek)<br/>(env: LLAMA_ARG_THINK) |
198
+
|`--jinja`| use jinja template for chat (default: enabled)<br/><br/>(env: LLAMA_ARG_JINJA) |
199
+
|`--no-jinja`| disable jinja template for chat (default: enabled)<br/><br/>(env: LLAMA_ARG_NO_JINJA) |
200
+
|`--reasoning-format FORMAT`| controls whether thought tags are allowed and/or extracted from the response, and in which format they're returned; one of:<br/>- none: leaves thoughts unparsed in `message.content`<br/>- deepseek: puts thoughts in `message.reasoning_content`<br/>- deepseek-legacy: keeps `<think>` tags in `message.content` while also populating `message.reasoning_content`<br/>(default: auto)<br/>(env: LLAMA_ARG_THINK) |
194
201
|`--reasoning-budget N`| controls the amount of thinking allowed; currently only one of: -1 for unrestricted thinking budget, or 0 to disable thinking (default: -1)<br/>(env: LLAMA_ARG_THINK_BUDGET) |
195
-
|`--chat-template JINJA_TEMPLATE`| set custom jinja chat template (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, phi3, phi4, rwkv-world, seed_oss, smolvlm, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE) |
196
-
|`--chat-template-file JINJA_TEMPLATE_FILE`| set custom jinja chat template file (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, phi3, phi4, rwkv-world, seed_oss, smolvlm, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE_FILE) |
202
+
|`--chat-template JINJA_TEMPLATE`| set custom jinja chat template (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, grok-2, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE) |
203
+
|`--chat-template-file JINJA_TEMPLATE_FILE`| set custom jinja chat template file (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, grok-2, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE_FILE) |
197
204
|`--no-prefill-assistant`| whether to prefill the assistant's response if the last message is an assistant message (default: prefill enabled)<br/>when this flag is set, if the last message is an assistant message then it will be treated as a full message and not prefilled<br/><br/>(env: LLAMA_ARG_NO_PREFILL_ASSISTANT) |
198
-
|`-sps, --slot-prompt-similarity SIMILARITY`| how much the prompt of a request must match the prompt of a slot in order to use that slot (default: 0.50, 0.0 = disabled)<br/> |
205
+
|`-sps, --slot-prompt-similarity SIMILARITY`| how much the prompt of a request must match the prompt of a slot in order to use that slot (default: 0.10, 0.0 = disabled)<br/> |
199
206
|`--lora-init-without-apply`| load LoRA adapters without applying them (apply later via POST /lora-adapters) (default: disabled) |
200
207
|`-td, --threads-draft N`| number of threads to use during generation (default: same as --threads) |
201
208
|`-tbd, --threads-batch-draft N`| number of threads to use during batch and prompt processing (default: same as --threads-draft) |
@@ -209,15 +216,17 @@ The project is under active development, and we are [looking for feedback and co
209
216
|`--spec-replace TARGET DRAFT`| translate the string in TARGET into DRAFT if the draft model and main model are not compatible |
210
217
|`-mv, --model-vocoder FNAME`| vocoder model for audio generation (default: unused) |
211
218
|`--tts-use-guide-tokens`| Use guide tokens to improve TTS word recall |
212
-
|`--embd-bge-small-en-default`| use default bge-small-en-v1.5 model (note: can download weights from the internet) |
213
-
|`--embd-e5-small-en-default`| use default e5-small-v2 model (note: can download weights from the internet) |
214
-
|`--embd-gte-small-default`| use default gte-small model (note: can download weights from the internet) |
219
+
|`--embd-gemma-default`| use default EmbeddingGemma model (note: can download weights from the internet) |
215
220
|`--fim-qwen-1.5b-default`| use default Qwen 2.5 Coder 1.5B (note: can download weights from the internet) |
216
221
|`--fim-qwen-3b-default`| use default Qwen 2.5 Coder 3B (note: can download weights from the internet) |
217
222
|`--fim-qwen-7b-default`| use default Qwen 2.5 Coder 7B (note: can download weights from the internet) |
218
223
|`--fim-qwen-7b-spec`| use Qwen 2.5 Coder 7B + 0.5B draft for speculative decoding (note: can download weights from the internet) |
219
224
|`--fim-qwen-14b-spec`| use Qwen 2.5 Coder 14B + 0.5B draft for speculative decoding (note: can download weights from the internet) |
220
225
|`--fim-qwen-30b-default`| use default Qwen 3 Coder 30B A3B Instruct (note: can download weights from the internet) |
226
+
|`--gpt-oss-20b-default`| use gpt-oss-20b (note: can download weights from the internet) |
227
+
|`--gpt-oss-120b-default`| use gpt-oss-120b (note: can download weights from the internet) |
228
+
|`--vision-gemma-4b-default`| use Gemma 3 4B QAT (note: can download weights from the internet) |
229
+
|`--vision-gemma-12b-default`| use Gemma 3 12B QAT (note: can download weights from the internet) |
221
230
222
231
223
232
Note: If both command line argument and environment variable are both set for the same param, the argument will take precedence over env var.
0 commit comments