Thoughts on OpenGL/CUDA interop #16

JamesPerlman · 2023-02-19T05:47:30Z

JamesPerlman
Feb 19, 2023
Maintainer

Currently in the open_for_cuda_access method of OpenGLRenderSurface we can see the CUDA/OpenGL interop happening. Notably, these lines here happen once every time draw() is called by the custom Blender Renger Engine.

https://github.com/JamesPerlman/NeRFRenderCore/blob/050ddf33f5577d6a25e9c23c2a994699b0dd080b/src/render-targets/opengl-render-surface.cuh#L102-L107

I have noticed that these specific calls to cudaGraphicsMapResources and cudaGraphicsUnmapResources cause noticeable lag, which is partially alleviated by calling glFlush() and glFinish() beforehand, but it is still a rather obnoxious thing that I'd like to get rid of. From the CUDA docs:

This function provides the synchronization guarantee that any graphics calls issued before cudaGraphicsMapResources() will complete before any subsequent CUDA work issued in stream begins.

Perhaps this is just a limitation of CUDA/OpenGL interoperability, perhaps I've set up my OpenGL buffers wrong, or perhaps CUDA is just waiting on Blender to finish some OpenGL work before that buffer can be mapped. There are alternative approaches I have tried, like setting up a surface to write to the texture directly (still laggy), and potentially even more approaches like setting up an offscreen renderbuffer which I have not tried yet. I have observed that the smaller the buffer, the lower the lag. I'm not sure what conclusions to draw from that.

Anyway, I have some ideas and I'm going to use this thread as a journal. Please feel free to pitch in if you have any ideas or experience with this.

My intention with using CUDA/OpenGL interop was to avoid doing a round trip from the GPU to the CPU and back to the GPU. I assumed that CUDA could simply blit the data buffer into another buffer managed by OpenGL, keeping all the data on the GPU. Perhaps it can through some obscure and undocumented manner, but with Blender deprecating OpenGL soon I might not need to worry about that for much longer.

For now, since we can't write to an OpenGL buffer on a background thread, there is a potential approach that may work. We can copy the data from CUDA to the host via a background thread, and then when the data is ready we can tag_redraw() to trigger OpenGL to draw, and then reupload the image to the GPU. There should be no lag from CUDA, and OpenGL should take care of the copying without noticeably locking the UI like it currently does.

The downside is this is inefficient from an architectural standpoint, as it does a roundtrip from GPU to CPU and back to GPU (which seems like it should be entirely avoidable), but I just don't know how much effort I want to spend learning the finer points of OpenGL/CUDA interop when OpenGL is being deprecated faster than it takes to train a NeRF. The upside is it will probably work and it's a stopgap measure to reduce visual lag in Blender previews until Vulkan becomes available to use instead.

JamesPerlman · 2023-02-24T00:29:50Z

JamesPerlman
Feb 24, 2023
Maintainer Author

So, I did implement this, and the results really speak for themselves. I'll post a new Twitter video soon (after I get instant previews working), but the "stickiness" of the panning-around interaction during previews in Blender is instantly remedied. Check out these excerpts:

https://github.com/JamesPerlman/NeRFRenderCore/blob/cfe107062d6fc8cac0aa67ea9059010d5731d97c/src/render-targets/cpu-render-buffer.cuh#L11

^ the new CPURenderBuffer makes use of page-locked memory via cudaMallocHost. This provides a host-side buffer that is very fast to copy to.

https://github.com/JamesPerlman/NeRFRenderCore/blob/cfe107062d6fc8cac0aa67ea9059010d5731d97c/src/integrations/blender.cuh#L30

^ this copies data from GPU to CPU on a background thread, and prepares it for use by OpenGL

https://github.com/JamesPerlman/NeRFRenderCore/blob/cfe107062d6fc8cac0aa67ea9059010d5731d97c/src/integrations/blender.cuh#L103

^ Now, in OpenGL we can call a glTexSubImage2D with the pointer to the color data, which asynchronously queues up a texture upload process without blocking the UI thread.

Unfortunately, we are doing a roundtrip from GPU -> CPU -> GPU, and there are going to be bandwidth limitations of this process. However, the user experience is much better than with cudaGraphicsMapResources().

This could be further improved to only copy data for areas that have been changed since last time. And beyond this, hopefully Vulkan will let us just write data into a texture directly with CUDA...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Thoughts on OpenGL/CUDA interop #16

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Thoughts on OpenGL/CUDA interop #16

Uh oh!

Uh oh!

JamesPerlman Feb 19, 2023 Maintainer

Replies: 1 comment

Uh oh!

JamesPerlman Feb 24, 2023 Maintainer Author

JamesPerlman
Feb 19, 2023
Maintainer

JamesPerlman
Feb 24, 2023
Maintainer Author