Releases: ngxson/llama.cpp
Releases · ngxson/llama.cpp
b5190
Force FP32 compute in GLM4 FFN Down (#13101) * Force FP32 compute in cuBLAS GEMM * Revert "Force FP32 compute in cuBLAS GEMM" This reverts commit 6efd872732159ab88ee7b3c1d77ba5ebc83079bd. * Force F32 compute in GLM4 ffn down * Edit comment to clarify issue Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>
b5189
clip : fix pixtral on some GPU backends (#13097) * clip : fix pixtral on some GPU backends * refactor inp_raw set * rm outdated comment * fix dynamic size * add TODO
b5188
change the reorder tensor from init to execute OP (#13003)
b5187
rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943) RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response. The performance impact of this change depends on the network latency.
b5186
clip : remove boi/eoi embeddings for GLM-edge model (#13081)
b5185
embeddings : fix batch sizes (#13076) ggml-ci
b5184
ggml : fix trailing whitespaces (#0)
b5181
CUDA: use switch statements in constexpr functions (#13095)
b5180
cmake : do not include ./src as public for libllama (#13062) * cmake : do not include ./src as public for libllama ggml-ci * cmake : rework tests ggml-ci * llguidance : remove unicode include ggml-ci * cmake : make c++17 private ggml-ci
b5178
arg : add --no-mmproj-offload (#13093) * arg : add --no-mmproj-offload * Update common/arg.cpp