Releases: ggml-org/llama.cpp
Releases · ggml-org/llama.cpp
b6586
model : add GroveMoE support (#15510) * add GroveMoE support * remove constexpr that fails on certain compilers * revert crude scalar div implementation, use cast * build_attn_inp_kv_unified -> build_attn_inp_kv * fix build_attn * re-apply ffn_exps regex changes
b6585
vendors: update miniaudio version (#16212) * vendor: update miniaudio.h Signed-off-by: Aaron Teo <[email protected]> * vendor: update miniaudio.h Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>
b6583
CUDA: add a fused top-K MoE kernel (#16130) * CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback
b6582
model-conversion : add embedding prompt file support (#15871) This commit adds support for passing a prompt file to the model conversion targets/scripts. It also updates the logits.cpp to print out embedding information in the same format as when running the original embedding model. The motivation for this is that it allows us to pass files of different sizes when running the converted models and validating the logits. This can be particularly important when testing the sliding window functionality of models where the sequence length needs to exceed a certain number of tokens to trigger the sliding window logic.
b6580
ggml : fix loongarch lsx compilation error (#15864)
b6578
llama : add support for qwen3 reranker (#15824)
b6576
metal : relax reorder conditions (#16216)
b6575
metal : restore im2col perf (#16219)
b6574
rpc : use ggml logging facilities Use RPC_DEBUG environment variable to enable debug messages. Add helper macro LOG_DBG() which does an early check of the env var before calling GGML_LOG_DEBUG(). Make sure we log a debug message for every server function.
b6572
ci: run the x64 and arm ci on the github machines instead (#16183) * run the x64 ci on regular machines * set up the same thing for arm fix test-quantize-perf just like #12306 * try to disable sve * add another sve run