Skip to content

Misc. bug: llama-server eval speed only 1/3 of benchmark and llama-cli #15297

@truncle

Description

@truncle

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes
version: 6139 (f4586ee)
built with clang version 19.1.5 for x86_64-pc-windows-msvc

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server.exe --prio 2 ^
-c 16384 --keep -1 -fa -kvu ^
--log-file log-server.txt ^
--port 8000 ^
--temp 1.0 --min-p 0.05 --top-p 1.0 --top-k 0 --gpu-layers 99 --swa-full ^
-m E:\Models\gpt-oss-20b-F16.gguf

Problem description & steps to reproduce

llama-server produce much slower eval speed than the benchmark and the llama-cli.
for gpt-oss-20b-F16.gguf it is around 50 tps on 5070Ti with < 40% GPU usage, while using llama-cli, it can produce over 140 tps, with > 90% GPU usage.
I also tried some other models like gemma-3-12b-it-UD-Q6_K.gguf, it runs at around 30 tps on llama-server, speed drop rate are at the same level.
I also tried to add/remove/modify some of the params that might influence the performance, but minor difference. Windows might runs the process on small cores which might influence the speed, but that is solved by --prio 2, which works fine.

benchmark

model size params backend ngl fa test t/s
gpt-oss ?B F16 12.83 GiB 20.91 B CUDA,RPC 99 1 pp512 5238.90 ± 53.42
gpt-oss ?B F16 12.83 GiB 20.91 B CUDA,RPC 99 1 tg2048 143.33 ± 1.00

First Bad Commit

I use binary releases directly. I tried several history version, for builds over b5000, the problems remains, so I don't exactly know when it starts. But as I first use llama.cpp, I noticed the performance diff, but then it is just around 10%, which I believed to be some server overhead, but that should not be so much as 60%.

Relevant log output

llama-server.exe -t 20 -tb 28 --prio 2 -c 16384 --keep -1 -fa -kvu --log-file log-server.txt --port 8000 --temp 1.0 --min-p 0.05 --top-p 1.0 --top-k 0 --gpu-layers 99 --swa-full -m E:\Models\gpt-oss-20b-F16.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes
build: 6139 (f4586ee5) with clang version 19.1.5 for x86_64-pc-windows-msvc
system info: n_threads = 20, n_threads_batch = 28, total_threads = 28

system_info: n_threads = 20 (n_threads_batch = 28) / 28 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8000, http threads: 27
main: loading model
srv    load_model: loading model 'E:\Models\gpt-oss-20b-F16.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5070 Ti) - 14921 MiB free
llama_model_loader: loaded meta data with 37 key-value pairs and 459 tensors from E:\Models\gpt-oss-20b-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gpt-oss
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gpt-Oss-20B
llama_model_loader: - kv   3:                           general.basename str              = Gpt-Oss-20B
llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   5:                         general.size_label str              = 20B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   8:                               general.tags arr[str,2]       = ["vllm", "text-generation"]
llama_model_loader: - kv   9:                        gpt-oss.block_count u32              = 24
llama_model_loader: - kv  10:                     gpt-oss.context_length u32              = 131072
llama_model_loader: - kv  11:                   gpt-oss.embedding_length u32              = 2880
llama_model_loader: - kv  12:                gpt-oss.feed_forward_length u32              = 2880
llama_model_loader: - kv  13:               gpt-oss.attention.head_count u32              = 64
llama_model_loader: - kv  14:            gpt-oss.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                     gpt-oss.rope.freq_base f32              = 150000.000000
llama_model_loader: - kv  16:   gpt-oss.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                       gpt-oss.expert_count u32              = 32
llama_model_loader: - kv  18:                  gpt-oss.expert_used_count u32              = 4
llama_model_loader: - kv  19:               gpt-oss.attention.key_length u32              = 64
llama_model_loader: - kv  20:             gpt-oss.attention.value_length u32              = 64
llama_model_loader: - kv  21:                          general.file_type u32              = 1
llama_model_loader: - kv  22:           gpt-oss.attention.sliding_window u32              = 128
llama_model_loader: - kv  23:         gpt-oss.expert_feed_forward_length u32              = 2880
llama_model_loader: - kv  24:                  gpt-oss.rope.scaling.type str              = yarn
llama_model_loader: - kv  25:                gpt-oss.rope.scaling.factor f32              = 32.000000
llama_model_loader: - kv  26: gpt-oss.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = gpt-4o
llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,201088]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,201088]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,446189]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  33:                tokenizer.ggml.bos_token_id u32              = 199998
llama_model_loader: - kv  34:                tokenizer.ggml.eos_token_id u32              = 200002
llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 200017
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {# Chat template fixes by Unsloth #}\n...
llama_model_loader: - type  f32:  289 tensors
llama_model_loader: - type  f16:   98 tensors
llama_model_loader: - type mxfp4:   72 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 12.83 GiB (5.27 BPW)
load: printing all EOG tokens:
load:   - 199999 ('<|endoftext|>')
load:   - 200002 ('<|return|>')
load:   - 200007 ('<|end|>')
load:   - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list
load: special tokens cache size = 21
load: token to piece cache size = 1.3332 MB
print_info: arch             = gpt-oss
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2880
print_info: n_layer          = 24
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 128
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 2880
print_info: n_expert         = 32
print_info: n_expert_used    = 4
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = yarn
print_info: freq_base_train  = 150000.0
print_info: freq_scale_train = 0.03125
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: model type       = ?B
print_info: model params     = 20.91 B
print_info: general.name     = Gpt-Oss-20B
print_info: n_ff_exp         = 2880
print_info: vocab type       = BPE
print_info: n_vocab          = 201088
print_info: n_merges         = 446189
print_info: BOS token        = 199998 '<|startoftext|>'
print_info: EOS token        = 200002 '<|return|>'
print_info: EOT token        = 199999 '<|endoftext|>'
print_info: PAD token        = 200017 '<|reserved_200017|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 199999 '<|endoftext|>'
print_info: EOG token        = 200002 '<|return|>'
print_info: EOG token        = 200012 '<|call|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors:        CUDA0 model buffer size = 12036.68 MiB
load_tensors:   CPU_Mapped model buffer size =  1104.61 MiB
....................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: kv_unified    = true
llama_context: freq_base     = 150000.0
llama_context: freq_scale    = 0.03125
llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.77 MiB
llama_kv_cache_unified_iswa: using full-size SWA cache (ref: https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 16384 cells
llama_kv_cache_unified:      CUDA0 KV buffer size =   384.00 MiB
llama_kv_cache_unified: size =  384.00 MiB ( 16384 cells,  12 layers,  1/1 seqs), K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_kv_cache_unified_iswa: creating     SWA KV cache, size = 16384 cells
llama_kv_cache_unified:      CUDA0 KV buffer size =   384.00 MiB
llama_kv_cache_unified: size =  384.00 MiB ( 16384 cells,  12 layers,  1/1 seqs), K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_context:      CUDA0 compute buffer size =   398.38 MiB
llama_context:  CUDA_Host compute buffer size =    69.65 MiB
llama_context: graph nodes  = 1352
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|return|> logit bias = -inf
common_init_from_params: added <|call|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16384
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 16384
main: model loaded
main: server is listening on http://127.0.0.1:8000 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET / 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /favicon.ico 127.0.0.1 404
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 16384, n_keep = -1, n_prompt_tokens = 16
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 16, n_tokens = 16, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 16, n_tokens = 16
slot      release: id  0 | task 0 | stop processing: n_past = 873, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =      78.67 ms /    16 tokens (    4.92 ms per token,   203.37 tokens per second)
       eval time =   17640.10 ms /   858 tokens (   20.56 ms per token,    48.64 tokens per second)
      total time =   17718.77 ms /   874 tokens

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions