[TIR] Add `T.thread_return()` for early thread exit in CUDA kernels #18134

Hzfengsy · 2025-07-11T11:41:23Z

This commit implements T.thread_return() functionality that allows threads to exit early from CUDA kernels. The feature is useful for cases where threads need to conditionally return based on thread indices or other conditions.

Key changes:

Add thread_return builtin in TIR
Implement CUDA codegen for thread_return
Add Python bindings for T.thread_return()
Update TIR IR builder to support thread_return
Add tests demonstrating thread_return usage

Example usage:

@T.prim_func
def main(A: T.Buffer((16, 16), "float32"), B: T.Buffer((16, 16), "float32")):
    for i in T.thread_binding(16, thread="blockIdx.x"):
        for j in T.thread_binding(32, thread="threadIdx.x"):
            if j >= 16:
                T.thread_return()  # Early exit for threads with j >= 16
            B[i, j] = A[i, j]

and generate code is:

extern "C" __global__ void __launch_bounds__(32) main_kernel(float* __restrict__ A, float* __restrict__ B) {
  if (16 <= ((int)threadIdx.x)) {
    return;
  }
  B[((((int)blockIdx.x) * 16) + ((int)threadIdx.x))] = A[((((int)blockIdx.x) * 16) + ((int)threadIdx.x))];
}

This commit implements T.thread_return() functionality that allows threads to exit early from CUDA kernels. The feature is useful for cases where threads need to conditionally return based on thread indices or other conditions. Key changes: - Add thread_return builtin in TIR - Implement CUDA codegen for thread_return - Add Python bindings for T.thread_return() - Update TIR IR builder to support thread_return - Add tests demonstrating thread_return usage Example usage: ```python @T.prim_func def main(A: T.Buffer((16, 16), "float32"), B: T.Buffer((16, 16), "float32")): for i in T.thread_binding(16, thread="blockIdx.x"): for j in T.thread_binding(32, thread="threadIdx.x"): if j >= 16: T.thread_return() # Early exit for threads with j >= 16 B[i, j] = A[i, j] ``` and generate code is: ```cuda extern "C" __global__ void __launch_bounds__(32) main_kernel(float* __restrict__ A, float* __restrict__ B) { if (16 <= ((int)threadIdx.x)) { return; } B[((((int)blockIdx.x) * 16) + ((int)threadIdx.x))] = A[((((int)blockIdx.x) * 16) + ((int)threadIdx.x))]; } ```

Hzfengsy · 2025-07-11T11:41:37Z

cc @LeiWang1999

…pache#18134) This commit implements T.thread_return() functionality that allows threads to exit early from CUDA kernels. The feature is useful for cases where threads need to conditionally return based on thread indices or other conditions. Key changes: - Add thread_return builtin in TIR - Implement CUDA codegen for thread_return - Add Python bindings for T.thread_return() - Update TIR IR builder to support thread_return - Add tests demonstrating thread_return usage Example usage: ```python @T.prim_func def main(A: T.Buffer((16, 16), "float32"), B: T.Buffer((16, 16), "float32")): for i in T.thread_binding(16, thread="blockIdx.x"): for j in T.thread_binding(32, thread="threadIdx.x"): if j >= 16: T.thread_return() # Early exit for threads with j >= 16 B[i, j] = A[i, j] ``` and generate code is: ```cuda extern "C" __global__ void __launch_bounds__(32) main_kernel(float* __restrict__ A, float* __restrict__ B) { if (16 <= ((int)threadIdx.x)) { return; } B[((((int)blockIdx.x) * 16) + ((int)threadIdx.x))] = A[((((int)blockIdx.x) * 16) + ((int)threadIdx.x))]; } ```

tqchen approved these changes Jul 14, 2025

View reviewed changes

tqchen merged commit ea4369c into apache:main Jul 14, 2025
13 checks passed

Hzfengsy deleted the thread_return branch September 16, 2025 14:09

ysh329 mentioned this pull request Oct 24, 2025

[Release] v0.22.0 Release Candidate Notes #18391

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TIR] Add `T.thread_return()` for early thread exit in CUDA kernels #18134

[TIR] Add `T.thread_return()` for early thread exit in CUDA kernels #18134

Uh oh!

Hzfengsy commented Jul 11, 2025

Uh oh!

Hzfengsy commented Jul 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[TIR] Add T.thread_return() for early thread exit in CUDA kernels #18134

[TIR] Add T.thread_return() for early thread exit in CUDA kernels #18134

Uh oh!

Conversation

Hzfengsy commented Jul 11, 2025

Uh oh!

Hzfengsy commented Jul 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[TIR] Add `T.thread_return()` for early thread exit in CUDA kernels #18134

[TIR] Add `T.thread_return()` for early thread exit in CUDA kernels #18134