Releases · ggml-org/llama.cpp

25 Sep 18:46

835b2b9

b6586

model : add GroveMoE support (#15510)

* add GroveMoE support

* remove constexpr that fails on certain compilers

* revert crude scalar div implementation, use cast

* build_attn_inp_kv_unified -> build_attn_inp_kv

* fix build_attn

* re-apply ffn_exps regex changes

Assets 15

25 Sep 16:30

github-actions

b6585

b05a9d6

b6585

vendors: update miniaudio version (#16212)

* vendor: update miniaudio.h

Signed-off-by: Aaron Teo <[email protected]>

* vendor: update miniaudio.h

Signed-off-by: Aaron Teo <[email protected]>

---------

Signed-off-by: Aaron Teo <[email protected]>

Assets 15

25 Sep 16:08

github-actions

b6583

077c94d

b6583

CUDA: add a fused top-K MoE kernel (#16130)

* CUDA: add a fused top-K MoE kernel

This kernel does the following:
1. softmax over the logits per token [n_experts, n_tokens]
2. argmax reduce over the top-k (n_experts_used) logits
3. write weights + ids to global memory

It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models

* Refactor into ggml_cuda_should_use_topk_moe

* Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before

* Review: format + micro-optimizations

* Fix bug: fix tie breakers

* Add optional norm + clean-up code

* Use smem for final write

* Add bounds check

* Use better memory pattern for writeback

Assets 15

25 Sep 12:47

github-actions

b6582

aa3ee0e

b6582

model-conversion : add embedding prompt file support (#15871)

This commit adds support for passing a prompt file to the model
conversion targets/scripts. It also updates the logits.cpp to print out
embedding information in the same format as when running the original
embedding model.

The motivation for this is that it allows us to pass files of different
sizes when running the converted models and validating the logits.

This can be particularly important when testing the sliding window
functionality of models where the sequence length needs to exceed a
certain number of tokens to trigger the sliding window logic.

Assets 15

25 Sep 12:34

github-actions

b6580

aa719c2

b6580

ggml : fix loongarch lsx compilation error (#15864)

Assets 15

25 Sep 12:11

github-actions

b6578

b5bd037

b6578

llama : add support for qwen3 reranker (#15824)

Assets 15

25 Sep 10:43

github-actions

b6576

4ea0079

b6576

metal : relax reorder conditions (#16216)

Assets 15

25 Sep 10:27

github-actions

b6575

02a6a82

b6575

metal : restore im2col perf (#16219)

Assets 15

25 Sep 08:05

github-actions

b6574

c498fc8

b6574

rpc : use ggml logging facilities

Use RPC_DEBUG environment variable to enable debug messages.
Add helper macro LOG_DBG() which does an early
check of the env var before calling GGML_LOG_DEBUG().
Make sure we log a debug message for every server function.

Assets 15

25 Sep 06:08

github-actions

b6572

bee378e

b6572

ci: run the x64 and arm ci on the github machines instead (#16183)

* run the x64 ci on regular machines

* set up the same thing for arm

fix test-quantize-perf just like #12306

* try to disable sve

* add another sve run

Assets 15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ggml-org/llama.cpp

b6586

Uh oh!

b6585

Uh oh!

b6583

Uh oh!

b6582

Uh oh!

b6580

Uh oh!

b6578

Uh oh!

b6576

Uh oh!

b6575

Uh oh!

b6574

Uh oh!

b6572

Uh oh!