Skip to content

cuda.bindings: Fix segfault when converting char* NULL to bytes #497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 7, 2025

Conversation

rwgk
Copy link
Collaborator

@rwgk rwgk commented Mar 6, 2025

The segfault was discovered while working on PR #458.

Full test coverage for changed code.

Copy link
Contributor

copy-pr-bot bot commented Mar 6, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk
Copy link
Collaborator Author

rwgk commented Mar 6, 2025

/ok to test

This comment has been minimized.

@leofang leofang requested a review from vzhurba01 March 6, 2025 22:02
@rwgk
Copy link
Collaborator Author

rwgk commented Mar 6, 2025

@leofang The two test failures are these flakes:

>           assert delay_seconds * 1000 <= elapsed_time_ms < delay_seconds * 1000 + 2  # tolerance 2 ms
E           assert 503.1598205566406 < ((0.5 * 1000) + 2)
>           assert delay_seconds * 1000 <= elapsed_time_ms < delay_seconds * 1000 + 2  # tolerance 2 ms
E           assert 505.0583190917969 < ((0.5 * 1000) + 2)

@vzhurba01
Copy link
Collaborator

Changes LGTM at the moment but I see that this is still a draft

@rwgk
Copy link
Collaborator Author

rwgk commented Mar 6, 2025

Changes LGTM at the moment

Thanks!

but I see that this is still a draft

I'm looking into adding more tests, to ideally cover all changes.

And we need to do something about the flakes, even though they are unrelated.

Observed failures:

```
>           assert delay_seconds * 1000 <= elapsed_time_ms < delay_seconds * 1000 + 2  # tolerance 2 ms
E           assert 503.1598205566406 < ((0.5 * 1000) + 2)
```

```
>           assert delay_seconds * 1000 <= elapsed_time_ms < delay_seconds * 1000 + 2  # tolerance 2 ms
E           assert 505.0583190917969 < ((0.5 * 1000) + 2)
```
@rwgk
Copy link
Collaborator Author

rwgk commented Mar 6, 2025

@leofang I piggy-backed commit 9dd2630 here to resolve the problem with the flaky tests.

@rwgk
Copy link
Collaborator Author

rwgk commented Mar 6, 2025

/ok to test

@rwgk rwgk marked this pull request as ready for review March 6, 2025 23:46
Copy link
Contributor

copy-pr-bot bot commented Mar 6, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk rwgk requested a review from leofang March 6, 2025 23:47
@leofang leofang added bug Something isn't working P0 High priority - Must do! cuda.bindings Everything related to the cuda.bindings module to-be-backported Trigger the bot to raise a backport PR upon merge labels Mar 6, 2025
@leofang leofang added this to the cuda-python 12-next, 11-next milestone Mar 6, 2025
@rwgk
Copy link
Collaborator Author

rwgk commented Mar 7, 2025

Woah, looks like I'm good at generating segfaults ...

Fatal Python error: Segmentation fault

Current thread 0x0000ffff9f213020 (most recent call first):
  File "/__w/cuda-python/cuda-python/cuda_bindings/tests/test_nvrtc.py", line 35 in test_nvrtcGetLoweredName_failure
...
  File "/opt/hostedtoolcache/Python/3.9.21/arm64/lib/python3.9/site-packages/_pytest/config/__init__.py", line 201 in console_main
  File "/opt/hostedtoolcache/Python/3.9.21/arm64/bin/pytest", line 8 in <module>
/__w/_temp/6c5a43e8-eaeb-4c41-8067-c06dcfabc91e.sh: line 12:  4545 Segmentation fault      (core dumped) pytest -rxXs -v tests/
tests/test_nvrtc.py::test_nvrtcGetLoweredName_failure 

@leofang should I just remove that new test and leave that for another PR? I don't think I'll get to the bottom of the newly discovered segfault today. Tomorrow pretty sure.

@leofang
Copy link
Member

leofang commented Mar 7, 2025

Just isolate out the cuda.core changes from those for bindings. Let's merge core-related things first; bindings can wait.

@rwgk
Copy link
Collaborator Author

rwgk commented Mar 7, 2025

/ok to test

@rwgk
Copy link
Collaborator Author

rwgk commented Mar 7, 2025

The cuda.core change is to resolve the flakes. (Without it I'd (maybe) have to hit rerun a few times to deflake.)

I just backed out the test that triggers the segfaults: commit a39720b

@leofang
Copy link
Member

leofang commented Mar 7, 2025

The cuda.core change is to resolve the flakes

Ah, I meant we can separate it out and merge it first. Mainly if we want to auto-backport the fix, the cuda-11 branch does not have any cuda.core code, so the apto-backport would fail. I am curious if it could work without any cuda-core changes involved.

rwgk added a commit to rwgk/cuda-python that referenced this pull request Mar 7, 2025
rwgk added a commit to rwgk/cuda-python that referenced this pull request Mar 7, 2025
kkraus14
kkraus14 previously approved these changes Mar 7, 2025
Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rwgk rwgk dismissed stale reviews from kkraus14 and leofang via 7814469 March 7, 2025 16:24
@rwgk
Copy link
Collaborator Author

rwgk commented Mar 7, 2025

/ok to test

@leofang
Copy link
Member

leofang commented Mar 7, 2025

@rwgk could you update the commit message and avoid @someone? Last time you did that I ended up receiving a ton of notifications whenever someone rebased a branch that contains this commit lol

(I noticed this because of a notification 😂)

@rwgk rwgk force-pushed the fix_segfault_char_ptr_to_bytes branch from 7814469 to ab6db22 Compare March 7, 2025 18:05
@rwgk
Copy link
Collaborator Author

rwgk commented Mar 7, 2025

@rwgk could you update the commit message and avoid @someone? Last time you did that I ended up receiving a ton of notifications whenever someone rebased a branch that contains this commit lol

Oh ... sorry, done.

(I did this a lot in the pybind11 repo, wanting to give proper credit. I didn't realize this can lead to spammy notifications.)

@leofang
Copy link
Member

leofang commented Mar 7, 2025

/ok to test

@leofang leofang enabled auto-merge March 7, 2025 18:09
@rwgk
Copy link
Collaborator Author

rwgk commented Mar 7, 2025

@leofang The tests passed previously with the exact same code. The only change was to the commit message. I.e. admin merge would be ideal.

@leofang
Copy link
Member

leofang commented Mar 7, 2025

ok

@leofang leofang disabled auto-merge March 7, 2025 18:18
@leofang leofang merged commit 82df864 into NVIDIA:main Mar 7, 2025
16 checks passed
Copy link

github-actions bot commented Mar 7, 2025

Backport failed because this pull request contains merge commits. You can either backport this pull request manually, or configure the action to skip merge commits.

@rwgk
Copy link
Collaborator Author

rwgk commented Mar 7, 2025

Backport failed because this pull request contains merge commits. You can either backport this pull request manually, or configure the action to skip merge commits.

Should I take care of that? (Happy to.)

@rwgk rwgk deleted the fix_segfault_char_ptr_to_bytes branch March 7, 2025 18:29
Copy link

github-actions bot commented Mar 7, 2025

Doc Preview CI
Preview removed because the pull request was closed or merged.

@leofang
Copy link
Member

leofang commented Mar 7, 2025

Should I take care of that? (Happy to.)

Yes, please. My bad in updating your branch, sorry...

@rwgk
Copy link
Collaborator Author

rwgk commented Mar 7, 2025

I'll backport this together with #499, after that is merged.

rwgk added a commit that referenced this pull request Mar 7, 2025
* PR #497 squash-merged

* Bring back test_nvrtcGetLoweredName_failure() (it was originally under PR #497).

* Add code for debugging

* test_all_CUresult_codes(): max_code = int(max(cuda.CUresult)) as suggested by at-leofang

* Change pytest options, mostly to disable output capturing (of both stdout and stderr)

* Undo debugging changes in nvrtc.pyx.in

* Revert "Change pytest options, mostly to disable output capturing (of both stdout and stderr)"

This reverts commit b0464e7.

* Skip new test if nvrtc version < 12.1
rwgk added a commit to rwgk/cuda-python that referenced this pull request Mar 8, 2025
@rwgk rwgk mentioned this pull request Mar 8, 2025
rwgk added a commit that referenced this pull request Mar 8, 2025
* Backport #497

* Backport #499

* Remove @pytest.mark.skipif(nvrtcVersionLessThan(12, 1), ...)

* Revert "Remove @pytest.mark.skipif(nvrtcVersionLessThan(12, 1), ...)"

This reverts commit 41160a8.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda.bindings Everything related to the cuda.bindings module P0 High priority - Must do! to-be-backported Trigger the bot to raise a backport PR upon merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants