Skip to content

cuda.bindings: Add test_nvrtcGetLoweredName_failure #499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 7, 2025

Conversation

rwgk
Copy link
Collaborator

@rwgk rwgk commented Mar 7, 2025

This test was first added while working on PR #497, but moved to this separate PR, so that we could focus on the Segmentation Fault in isolation.

It turns out that only CTK 12.0 does not handle invalid prog arguments gracefully, newer CTKs do. The approach taken here is to simply skip the new test for CTK < 12.1. This will be useful as a stress-test for future CTK releases.

See #499 (comment) for more details.

Copy link
Contributor

copy-pr-bot bot commented Mar 7, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk
Copy link
Collaborator Author

rwgk commented Mar 7, 2025

/ok to test

This comment has been minimized.

@rwgk
Copy link
Collaborator Author

rwgk commented Mar 7, 2025

/ok to test

@rwgk
Copy link
Collaborator Author

rwgk commented Mar 7, 2025

All jobs with Segmentation faults are running with CUDA 12.0.1:

After unpacking logs_35364236774.zip (https://github.com/NVIDIA/cuda-python/actions/runs/13725931628/)

$ grep 'Segmentation fault' *.txt | cut -d: -f1 | uniq
23_Test__linux-64__Python_3.10__CUDA_12.0.1__Runner_default__local_CTK____test.txt
27_Test__linux-64__Python_3.11__CUDA_12.0.1__Runner_default__local_CTK____test.txt
32_Test__linux-64__Python_3.12__CUDA_12.0.1__Runner_default__local_CTK____test.txt
38_Test__linux-64__Python_3.13__CUDA_12.0.1__Runner_default__local_CTK____test.txt
44_Test__linux-64__Python_3.9__CUDA_12.0.1__Runner_default__local_CTK____test.txt
47_Test__linux-aarch64__Python_3.10__CUDA_12.0.1__Runner_default__local_CTK____test.txt
53_Test__linux-aarch64__Python_3.11__CUDA_12.0.1__Runner_default__local_CTK____test.txt
58_Test__linux-aarch64__Python_3.12__CUDA_12.0.1__Runner_default__local_CTK____test.txt
63_Test__linux-aarch64__Python_3.13__CUDA_12.0.1__Runner_default__local_CTK____test.txt
67_Test__linux-aarch64__Python_3.9__CUDA_12.0.1__Runner_default__local_CTK____test.txt

EDIT: Observed under #502: Segmentation faults with CUDA 11.8.0

@rwgk
Copy link
Collaborator Author

rwgk commented Mar 7, 2025

All 10 failing jobs segfault here (test_nvrtc.py):

    err, name = nvrtc.nvrtcGetLoweredName(None, b"I'm an elevated name!")       
2025-03-07T17:45:10.4215079Z tests/test_nvrtc.py::test_nvrtcGetLoweredName_failure
2025-03-07T17:45:10.4215919Z Fatal Python error: Segmentation fault             
...
2025-03-07T17:45:10.4242031Z LOOOK nvrtcGetLoweredName ENTRY                    
2025-03-07T17:45:10.4242791Z LOOOK nvrtcGetLoweredName prog is None             
2025-03-07T17:45:10.4243909Z LOOOK nvrtcGetLoweredName BEFORE cynvrtc.nvrtcGetLoweredName(cyprog, name_expression, &lowered_name)
+++ b/cuda_bindings/cuda/bindings/nvrtc.pyx.in
@@ -859,17 +859,25 @@ def nvrtcGetLoweredName(prog, char* name_expression):
     --------
     nvrtcAddNameExpression
     """
+    print("\nLOOOK nvrtcGetLoweredName ENTRY", flush=True)
     cdef cynvrtc.nvrtcProgram cyprog
     if prog is None:
+        print("LOOOK nvrtcGetLoweredName prog is None", flush=True)
         cyprog = <cynvrtc.nvrtcProgram><void_ptr>0
     elif isinstance(prog, (nvrtcProgram,)):
+        print("LOOOK nvrtcGetLoweredName isinstance(prog, (nvrtcProgram,))", flush=True)
         pprog = int(prog)
+        print("LOOOK nvrtcGetLoweredName AFTER int(prog)", flush=True)
         cyprog = <cynvrtc.nvrtcProgram><void_ptr>pprog
     else:
+        print("LOOOK nvrtcGetLoweredName else", flush=True)
         pprog = int(nvrtcProgram(prog))
+        print("LOOOK nvrtcGetLoweredName AFTER int(nvrtcProgram(prog))", flush=True)
         cyprog = <cynvrtc.nvrtcProgram><void_ptr>pprog
     cdef const char* lowered_name = NULL
+    print("LOOOK nvrtcGetLoweredName BEFORE cynvrtc.nvrtcGetLoweredName(cyprog, name_expression, &lowered_name)", flush=True)
     err = cynvrtc.nvrtcGetLoweredName(cyprog, name_expression, &lowered_name)
+    print("LOOOK nvrtcGetLoweredName RETURN\n", flush=True)
     return (nvrtcResult(err), <bytes>lowered_name if lowered_name != NULL else None)

@rwgk rwgk force-pushed the fix_nvrtc_segfault branch from feb0a4a to 73d8f4a Compare March 7, 2025 20:22
@rwgk
Copy link
Collaborator Author

rwgk commented Mar 7, 2025

/ok to test

@vzhurba01
Copy link
Collaborator

While debugging this there are two things that stood out:

  1. The segfault happens after we've called into the Cython layer. In that layer all the data is already packaged and the only action taken is the call into the NVRTC library.
  2. When looking at the header differences between between 12.8 (passing) and 12.0 (failing) the API signature and types are identical... but the documentation has minor differences.

https://github.com/NVIDIA/cuda-python/blob/v12.0.0/cuda/nvrtc.pyx.in#L784
https://github.com/NVIDIA/cuda-python/blob/v12.8.0/cuda_bindings/cuda/bindings/nvrtc.pyx.in#L808

In 12.8.0 the API nvrtcAddNameExpression() states that it can return NVRTC_ERROR_INVALID_PROGRAM, but this line is missing in 12.0.0. This difference reflects what we're seeing with nvrtc.nvrtcGetLoweredName() where the error check has been fixed but documentation does not reflect it.

From offline discussion, having us skip this test on 12.0 would allow us to continue stress testing new releases.

@rwgk rwgk changed the title WIP/debugging: nvrtc.nvrtcGetLoweredName() segfault cuda.bindings: Add test_nvrtcGetLoweredName_failure Mar 7, 2025
@rwgk rwgk marked this pull request as ready for review March 7, 2025 20:33
Copy link
Contributor

copy-pr-bot bot commented Mar 7, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk rwgk merged commit 96d3ba0 into NVIDIA:main Mar 7, 2025
73 checks passed
Copy link

github-actions bot commented Mar 7, 2025

Doc Preview CI
Preview removed because the pull request was closed or merged.

rwgk added a commit to rwgk/cuda-python that referenced this pull request Mar 8, 2025
@rwgk rwgk mentioned this pull request Mar 8, 2025
@leofang leofang added the bug Something isn't working label Mar 8, 2025
@leofang leofang added P0 High priority - Must do! test Improvements or additions to tests cuda.bindings Everything related to the cuda.bindings module labels Mar 8, 2025
@leofang leofang added this to the cuda-python 12-next, 11-next milestone Mar 8, 2025
rwgk added a commit that referenced this pull request Mar 8, 2025
* Backport #497

* Backport #499

* Remove @pytest.mark.skipif(nvrtcVersionLessThan(12, 1), ...)

* Revert "Remove @pytest.mark.skipif(nvrtcVersionLessThan(12, 1), ...)"

This reverts commit 41160a8.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda.bindings Everything related to the cuda.bindings module P0 High priority - Must do! test Improvements or additions to tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants