Skip to content

Releases: ggml-org/llama.cpp

b6593

26 Sep 11:45
b995a10
Compare
Choose a tag to compare
common : use cpp-httplib as a cURL alternative for downloads (#16185)

* vendor : update httplib

Signed-off-by: Adrien Gallouët <[email protected]>

* common : use cpp-httplib as a cURL alternative for downloads

The existing cURL implementation is intentionally left untouched to
prevent any regressions and to allow for safe, side-by-side testing by
toggling the `LLAMA_CURL` CMake option.

Signed-off-by: Adrien Gallouët <[email protected]>

* ggml : Bump to Windows 10

Signed-off-by: Adrien Gallouët <[email protected]>

---------

Signed-off-by: Adrien Gallouët <[email protected]>

b6591

26 Sep 11:03
9b26511
Compare
Choose a tag to compare
ggml-cpu: implement MXFP4 SIMD for s390x (#16193)

* ggml-cpu: impl mxfp4 s390x

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: missing s = sumf

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix incorrect kval_mxfp4 type

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: rework mxfp4

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: missing delta calc

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix typo

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix typo for vec_splats

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: expand to 2 blocks per loop

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: add unroll to boost perf

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: back to 1 block per loop to test perf

Signed-off-by: Aaron Teo <[email protected]>

* Revert "ggml-cpu: back to 1 block per loop to test perf"

This reverts commit 1fe55724e2dc295701101bf838bdd4a512237492.

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: rm unroll from single block

Signed-off-by: Aaron Teo <[email protected]>

---------

Signed-off-by: Aaron Teo <[email protected]>

b6587

26 Sep 01:21
0f7c696
Compare
Choose a tag to compare
musa: fix build warnings (#15611)

Signed-off-by: Xiaodong Ye <[email protected]>

b6586

25 Sep 18:46
835b2b9
Compare
Choose a tag to compare
model : add GroveMoE support (#15510)

* add GroveMoE support

* remove constexpr that fails on certain compilers

* revert crude scalar div implementation, use cast

* build_attn_inp_kv_unified -> build_attn_inp_kv

* fix build_attn

* re-apply ffn_exps regex changes

b6585

25 Sep 16:30
b05a9d6
Compare
Choose a tag to compare
vendors: update miniaudio version (#16212)

* vendor: update miniaudio.h

Signed-off-by: Aaron Teo <[email protected]>

* vendor: update miniaudio.h

Signed-off-by: Aaron Teo <[email protected]>

---------

Signed-off-by: Aaron Teo <[email protected]>

b6583

25 Sep 16:08
077c94d
Compare
Choose a tag to compare
CUDA: add a fused top-K MoE kernel (#16130)

* CUDA: add a fused top-K MoE kernel

This kernel does the following:
1. softmax over the logits per token [n_experts, n_tokens]
2. argmax reduce over the top-k (n_experts_used) logits
3. write weights + ids to global memory

It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models

* Refactor into ggml_cuda_should_use_topk_moe

* Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before

* Review: format + micro-optimizations

* Fix bug: fix tie breakers

* Add optional norm + clean-up code

* Use smem for final write

* Add bounds check

* Use better memory pattern for writeback

b6582

25 Sep 12:47
aa3ee0e
Compare
Choose a tag to compare
model-conversion : add embedding prompt file support (#15871)

This commit adds support for passing a prompt file to the model
conversion targets/scripts. It also updates the logits.cpp to print out
embedding information in the same format as when running the original
embedding model.

The motivation for this is that it allows us to pass files of different
sizes when running the converted models and validating the logits.

This can be particularly important when testing the sliding window
functionality of models where the sequence length needs to exceed a
certain number of tokens to trigger the sliding window logic.

b6580

25 Sep 12:34
aa719c2
Compare
Choose a tag to compare
ggml : fix loongarch lsx compilation error (#15864)

b6578

25 Sep 12:11
b5bd037
Compare
Choose a tag to compare
llama : add support for qwen3 reranker (#15824)

b6576

25 Sep 10:43
4ea0079
Compare
Choose a tag to compare
metal : relax reorder conditions (#16216)