Releases · ggml-org/llama.cpp

26 Sep 11:45

b995a10

b6593

common : use cpp-httplib as a cURL alternative for downloads (#16185)

* vendor : update httplib

Signed-off-by: Adrien Gallouët <[email protected]>

* common : use cpp-httplib as a cURL alternative for downloads

The existing cURL implementation is intentionally left untouched to
prevent any regressions and to allow for safe, side-by-side testing by
toggling the `LLAMA_CURL` CMake option.

Signed-off-by: Adrien Gallouët <[email protected]>

* ggml : Bump to Windows 10

Signed-off-by: Adrien Gallouët <[email protected]>

---------

Signed-off-by: Adrien Gallouët <[email protected]>

Assets 15

26 Sep 11:03

github-actions

b6591

9b26511

b6591

ggml-cpu: implement MXFP4 SIMD for s390x (#16193)

* ggml-cpu: impl mxfp4 s390x

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: missing s = sumf

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix incorrect kval_mxfp4 type

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: rework mxfp4

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: missing delta calc

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix typo

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix typo for vec_splats

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: expand to 2 blocks per loop

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: add unroll to boost perf

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: back to 1 block per loop to test perf

Signed-off-by: Aaron Teo <[email protected]>

* Revert "ggml-cpu: back to 1 block per loop to test perf"

This reverts commit 1fe55724e2dc295701101bf838bdd4a512237492.

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: rm unroll from single block

Signed-off-by: Aaron Teo <[email protected]>

---------

Signed-off-by: Aaron Teo <[email protected]>

Assets 15

26 Sep 01:21

github-actions

b6587

0f7c696

b6587

musa: fix build warnings (#15611)

Signed-off-by: Xiaodong Ye <[email protected]>

Assets 15

25 Sep 18:46

github-actions

b6586

835b2b9

b6586

model : add GroveMoE support (#15510)

* add GroveMoE support

* remove constexpr that fails on certain compilers

* revert crude scalar div implementation, use cast

* build_attn_inp_kv_unified -> build_attn_inp_kv

* fix build_attn

* re-apply ffn_exps regex changes

Assets 15

25 Sep 16:30

github-actions

b6585

b05a9d6

b6585

vendors: update miniaudio version (#16212)

* vendor: update miniaudio.h

Signed-off-by: Aaron Teo <[email protected]>

* vendor: update miniaudio.h

Signed-off-by: Aaron Teo <[email protected]>

---------

Signed-off-by: Aaron Teo <[email protected]>

Assets 15

25 Sep 16:08

github-actions

b6583

077c94d

b6583

CUDA: add a fused top-K MoE kernel (#16130)

* CUDA: add a fused top-K MoE kernel

This kernel does the following:
1. softmax over the logits per token [n_experts, n_tokens]
2. argmax reduce over the top-k (n_experts_used) logits
3. write weights + ids to global memory

It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models

* Refactor into ggml_cuda_should_use_topk_moe

* Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before

* Review: format + micro-optimizations

* Fix bug: fix tie breakers

* Add optional norm + clean-up code

* Use smem for final write

* Add bounds check

* Use better memory pattern for writeback

Assets 15

25 Sep 12:47

github-actions

b6582

aa3ee0e

b6582

model-conversion : add embedding prompt file support (#15871)

This commit adds support for passing a prompt file to the model
conversion targets/scripts. It also updates the logits.cpp to print out
embedding information in the same format as when running the original
embedding model.

The motivation for this is that it allows us to pass files of different
sizes when running the converted models and validating the logits.

This can be particularly important when testing the sliding window
functionality of models where the sequence length needs to exceed a
certain number of tokens to trigger the sliding window logic.

Assets 15

25 Sep 12:34

github-actions

b6580

aa719c2

b6580

ggml : fix loongarch lsx compilation error (#15864)

Assets 15

25 Sep 12:11

github-actions

b6578

b5bd037

b6578

llama : add support for qwen3 reranker (#15824)

Assets 15

25 Sep 10:43

github-actions

b6576

4ea0079

b6576

metal : relax reorder conditions (#16216)

Assets 15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ggml-org/llama.cpp

b6593

Uh oh!

b6591

Uh oh!

b6587

Uh oh!

b6586

Uh oh!

b6585

Uh oh!

b6583

Uh oh!

b6582

Uh oh!

b6580

Uh oh!

b6578

Uh oh!

b6576

Uh oh!