-
Notifications
You must be signed in to change notification settings - Fork 66
CPU fallback: do color-conversion on GPU. #992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As can be seen above we are now using swscale in BetaCudaInterface.cpp to do the YUV -> NV12 conversion. So I had to extract out all the swscale logic away from CpuDeviceInteface.cpp and put it into FFmpegCommon.cpp so it can be reused across the two interfaces. Almost everything below this comment can be ignored and treated as copy/pasting code around. Just pay attention to the test at the bottom.
test/test_decoders.py
Outdated
| beta_frame = beta_dec.get_frame_at(0) | ||
|
|
||
| torch.testing.assert_close(ffmpeg.data, beta.data, rtol=0, atol=0) | ||
| assert psnr(ref_frames.data.cpu(), beta_frame.data.cpu()) > 25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does comparing frames on GPU vs CPU change floating point precision, or is there some other reason to move the frames here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there will be a small difference due to floating point precision, but that wasn't the reason.
In fact... there was no good reason to call .cpu(), that was probably the result of copy/pasting from my previous write_png debug. I removed it, thanks for catching!
| UniqueAVFrame nv12CpuFrame(av_frame_alloc()); | ||
| TORCH_CHECK(nv12CpuFrame != nullptr, "Failed to allocate NV12 CPU frame"); | ||
|
|
||
| nv12CpuFrame->format = AV_PIX_FMT_NV12; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it accurate to say that NV12 very similar to AV_PIX_FMT_YUV420P (uses YUV, and has 4:2:0 chroma subsampling), but we use NV12 here because that is the format the NPP library requires? As explained in this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's exactly right. NV12 would contain the exact same values as AV_PIX_FMT_YUV420P, just ordered a bit differently.
Towards #943
This PR speeds up the BETA Cuda interface when we need to fallback on the CPU. The idea is simple: we do the color conversion on the GPU instead of doing them on the CPU. This has 2 benefits:
Before
Now
This is for the BETA interface only. I'll handle the FFmpeg interface as a follow-up.
Benchmarks
Benchmarks show 1.6X speed up on 1080p frames. We should expect the larger speed-ups for larger resolutions. I used this snippet.
It's impossible to benchmark the new and old strategies together, since we need recompilation. Also I had to add modify our code to enforce the CPU fallback to be activated on videos that would otherwise not be falling back (basically just changed
if (!nativeNVDECSupport(codecContext))...toif(true))