-
Notifications
You must be signed in to change notification settings - Fork 12.8k
vulkan: Use larger workgroups for mul_mat_vec when M is small #15355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9070xt before:
9070xt after:
a770 before:
a770 after:
On Intel with FA, the performance is expectedly lower in both cases: 48.11 ± 0 (PR) and 48.32 ± 0 (master). |
This was with the change as is, right? Not enabling larger workgroups? |
Yes, I copied the benchmarking flags from your results. Should I run it any other way? |
This is a good data point, thanks. It means subgroupAdd is a speedup. The other thing to try would be to enable the code at https://github.com/ggml-org/llama.cpp/pull/15355/files#diff-35a5049d5eebe22eda1e0d661bd87639b31aafeba62deeaaaca9c13ec3e71d11R4433 unconditionally. |
With this diff: diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index e255f68c2..19459c737 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -2766,7 +2766,7 @@ static void ggml_vk_load_shaders(vk_device& device) {
uint32_t wg_size_subgroup16 = (w == DMMV_WG_SIZE_SUBGROUP) ? subgroup_size_16 : (subgroup_size_16 * 4);
uint32_t wg_size_subgroup = (w == DMMV_WG_SIZE_SUBGROUP) ? device->subgroup_size : (device->subgroup_size * 4);
- bool s = device->subgroup_add;
+ const bool s = device->subgroup_add && device->architecture != vk_device_architecture::AMD_GCN;
for (uint32_t i = 0; i < mul_mat_vec_max_cols; ++i) {
ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_vec_f32_f32[w][GGML_TYPE_F32 ][i], "mul_mat_vec_f32_f32_f32_"+std::to_string(w)+"_"+std::to_string(i+1), arr_dmmv_f32_f32_f32_len[s], arr_dmmv_f32_f32_f32_data[s], "main", 3, sizeof(vk_mat_vec_push_constants), {2, 1, 1}, {wg_size_subgroup, 2, i+1}, 1);
@@ -4406,7 +4406,7 @@ static vk_pipeline ggml_vk_get_dequantize_mul_mat_vec(ggml_backend_vk_context *
// heuristic to choose workgroup size
uint32_t dmmv_wg = DMMV_WG_SIZE_SUBGROUP;
- if (ctx->device->vendor_id == VK_VENDOR_ID_NVIDIA) {
+ if (ctx->device->vendor_id == VK_VENDOR_ID_NVIDIA || ctx->device->vendor_id == VK_VENDOR_ID_INTEL) {
// Prefer larger workgroups when M is small, to spread the work out more
// and keep more SMs busy.
// q6_k seems to prefer small workgroup size even for "medium" values of M. Nvidia RTX 3090Master:
PR:
Intel A770Master:
PR:
AMD RX 6800 XTMaster:
PR:
AMD Radeon Pro VIIMaster:
PR:
|
9070xt:
a770:
master:
a770:
|
On my 470 there's barely any difference with subgroup adds turned on or off, but the Nvidia small m fix makes my inference 10% slower when I forcibly enable it on my card. |
Thanks for all the testing. I've applied @0cc4m's diff. |
Can you rebase this? |
Also use subgroup instructions for (part of) the reduction when supported. Without this, the more expensive reductions would eat into the benefits of the larger workgroups.
Co-authored-by: 0cc4m <[email protected]>
Rebased. |
Hi everyone, not sure if others experience the same issue too, but this commit raises the time it takes for me to compile ggml-vulkan.cpp from 3 minutes (in commit 19f4dec) to 14 minutes (this). |
What is your build environment? It takes about 30s to build ggml-vulkan.cpp on MSVC (which I already consider too high) but I didn't see it get worse from this change. I've been wanting to split ggml_vk_load_shaders into its own file because heavy macro use tends to be pretty expensive to compile. But it always seems like a bad time (conflicts). |
Hi @jeffbolznv, I kinda figured it out, TLDR at bottom. First, here's a simple way to repro this. Note that this is just to demonstrate the exact way on how to reproduce this issue, it's not a complete binary build. I am using First, download and install the latest Vulkan SDK for windows at https://sdk.lunarg.com/sdk/download/1.4.321.1/windows/vulkansdk-windows-X64-1.4.321.1.exe Clone the llama.cpp project
Build the Vulkan shader generator binary (no issue here)
Build the Vulkan shaders, everything enabled (no issues here)
Verify we have vulkan precompiled shaders and now run the compile for
and...
Absolutely awful. Now we make just one simple change, removing
and we get
Wow ten seconds, literally over 60 times faster build speed. I tried again with Question: Are g++ compiler optimization flags really necessary for this target? I suppose I could just exclude it for normal builds. TLDR: This happen when using any |
We need at least -O2. Can you try removing all the |
MUCH better. I replaced all
Result:
So we have gone from 12 minutes to 1 minute. It's still slightly slower than the previous commit, but this is a massive improvement. I think you have cracked the case. Edit: I also did
So about 2 minutes is a more fair comparison |
Cool, I'll make a change to simplify the strings. We used to require unique strings (names were used to lookup pipelines) but now they're just informative. |
PR to shorten the strings: #15431 |
Use larger workgroups for mul_mat_vec when
m
is small. Also use subgroup instructions for (part of) the reduction when supported. Without this, the more expensive reductions would eat into the benefits of the larger workgroups.Many models have some matrices with small
m
that don't launch enough work to fill the GPU (particularly for larger GPUs). Using larger workgroups helps. I think ggml-cuda already (mostly?) uses 128 threads per CTA.As currently written, non-NVIDIA GPUs continue to use the same workgroup size, but I'm happy to change it based on perf testing. This change does make it so other GPUs will use subgroupAdd if it's supported, so hopefully that works everywhere and is not a slowdown, but it's easy enough to change if not.
Here's a comparison from GGML_VK_PERF_LOGGER running gpt-oss-20b-mxfp4.gguf, the buckets with small
m
are significantly cheaper.