-
Notifications
You must be signed in to change notification settings - Fork 13.1k
ggml-cuda: Vulkan direct conv 2D ported to CUDA #16088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
* Extra: reduces bank conflicts
@Green-Sky Can you check it? |
sd.cp relies exclusively on f16 kernels. |
May I suggest using |
This would make things easy for me, yes. BTW, forgot to thank you @etasnadi for working on this :) ... even though we now have 3 competing prs, more or less. |
I'll close my PR. This one is way better:) |
Same, closing mine too. |
Maybe you can add a parallel pr based on this for f16?
|
I’m new to CUDA but I’d love to give this a shot @Green-Sky @etasnadi if the fp16 isn’t super urgent?, I can take a crack at it in the next week or two.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give me a list of what parts of the code you changed relative to the Vulkan version, if any? Some things like fastdiv and how to retrieve the SM count have equivalents in the CUDA backend. But if this is just a copy-paste of the Vulkan code I would preferably change as little as possible.
Can you give any reference to doing fastdiv/sm_count() in proper ggml-cuda way? I will refactor then. Only the necessary things are changed compared to Vulkan, but they are significant
IMO it already changes as little as possible compared to Vulkan. |
For a |
@bssrdf Do you want to contribute the ggml-cuda conformant fastdiv as a patch to my branch or in a separate PR so everyone gets the authorship for conv2d for their effort? |
@etasnadi, I can give a try. Will do a patch on your branch. |
In the f16 pr by @bssrdf , we found that sd.cpp crashes with this pr. I double checked by forcing sd.cpp to use f32 for the kernel without @bssrdf 's pr.
|
@Green-Sky, it may be due to my changes. I'll investigate. |
I redid the test without your changes, and the issue was the same, as I state right there. |
@Green-Sky, without my change, it will fall back to using the slow direct version. Did it even fail there? |
I patched sd.cpp to cast the kernel to f32, so it would fall back. - x = ggml_conv_2d_direct(ctx, w, x, s0, s1, p0, p1, d0, d1);
+ x = ggml_conv_2d_direct(ctx, ggml_cast(ctx, w, GGML_TYPE_F32), x, s0, s1, p0, p1, d0, d1); I guess I should have made my report more wordy (: edit: fun side note, it seems like with the current naive fallback and f32 kernel, sd.cpp vae decode is ever so slightly faster (~35s vs ~33s) |
I am adding this, because the current conv2d alg #15635 seems to underutilize the GPU -- the Vulkan version #14316 & #14933 is 8-10 times faster on my device. Additionally, the Tensor Cores extension #15813 of the previous alg also seems to be slower than this.
There is another CUDA conv2d proposal that could be related #15805.
Furthermore, this version introduces bank conflict reduction that is not added to Vulkan yet. It seems to be effective on large problems. I expect that this version will be even more efficient than the Vulkan backend.
I do not support f16 yet, a future contribution might do that. Currently this alg will be used when for f32 inputs, otherwise it falls back to the previous implementation.
GGML_CUDA_USE_LEGACY_CONV
forces to use the previous (probably slower) implementation.Perf of previous on RTX 2060:
Perf of proposed: