[NVPTX] Unify and extend barrier{.cta} intrinsic support #140615

AlexMaclean · 2025-05-19T20:41:45Z

Our current intrinsic support for barrier intrinsics is confusing and incomplete, with multiple intrinsics mapping to the same instruction and intrinsic names not clearly conveying intrinsic semantics. Further, we lack support for some variants. This change unifies the IR representation to a single consistently named set of intrinsics.

llvm.nvvm.barrier.cta.sync.aligned.all(i32)
llvm.nvvm.barrier.cta.sync.aligned(i32, i32)
llvm.nvvm.barrier.cta.arrive.aligned(i32, i32)
llvm.nvvm.barrier.cta.sync.all(i32)
llvm.nvvm.barrier.cta.sync(i32, i32)
llvm.nvvm.barrier.cta.arrive(i32, i32)

The following Auto-Upgrade rules are used to maintain compatibility with IR using the legacy intrinsics:

llvm.nvvm.barrier0 --> llvm.nvvm.barrier.cta.sync.aligned.all(0)
llvm.nvvm.barrier.n --> llvm.nvvm.barrier.cta.sync.aligned.all(x)
llvm.nvvm.bar.sync --> llvm.nvvm.barrier.cta.sync.aligned.all(x)
llvm.nvvm.barrier --> llvm.nvvm.barrier.cta.sync.aligned(x, y)
llvm.nvvm.barrier.sync --> llvm.nvvm.barrier.cta.sync.all(x)
llvm.nvvm.barrier.sync.cnt --> llvm.nvvm.barrier.cta.sync(x, y)

llvmbot · 2025-05-19T20:42:23Z

@llvm/pr-subscribers-mlir
@llvm/pr-subscribers-mlir-llvm
@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-clang
@llvm/pr-subscribers-llvm-ir
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-nvptx

Author: Alex MacLean (AlexMaclean)

Changes

Our current intrinsic support for barrier intrinsics is confusing and incomplete, with multiple intrinsics mapping to the same instruction and intrinsic names not clearly conveying intrinsic semantics. Further, we lack support for some variants. This change unifies the IR representation to a single consistently named set of intrinsics.

llvm.nvvm.barrier.cta.sync.aligned.all(i32)
llvm.nvvm.barrier.cta.sync.aligned(i32, i32)
llvm.nvvm.barrier.cta.arrive.aligned(i32, i32)
llvm.nvvm.barrier.cta.sync.all(i32)
llvm.nvvm.barrier.cta.sync(i32, i32)
llvm.nvvm.barrier.cta.arrive(i32, i32)

The following Auto-Upgrade rules are used to maintain compatibility with IR using the legacy intrinsics:

llvm.nvvm.barrier0 --> llvm.nvvm.barrier.cta.sync.aligned.all(0)
llvm.nvvm.barrier.n --> llvm.nvvm.barrier.cta.sync.aligned.all(x)
llvm.nvvm.bar.sync --> llvm.nvvm.barrier.cta.sync.aligned.all(x)
llvm.nvvm.barrier --> llvm.nvvm.barrier.cta.sync.aligned(x, y)
llvm.nvvm.barrier.sync --> llvm.nvvm.barrier.cta.sync.all(x)
llvm.nvvm.barrier.sync.cnt --> llvm.nvvm.barrier.cta.sync(x, y)

Patch is 27.12 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/140615.diff

10 Files Affected:

(modified) llvm/docs/NVPTXUsage.rst (+41-4)
(modified) llvm/include/llvm/IR/IntrinsicsNVVM.td (+18-19)
(modified) llvm/lib/IR/AutoUpgrade.cpp (+18)
(modified) llvm/lib/Target/NVPTX/NVPTXInstrInfo.td (+28)
(modified) llvm/lib/Target/NVPTX/NVPTXIntrinsics.td (+37-34)
(modified) llvm/lib/Transforms/IPO/AttributorAttributes.cpp (+2-1)
(modified) llvm/test/Assembler/auto_upgrade_nvvm_intrinsics.ll (+22)
(modified) llvm/test/CodeGen/NVPTX/barrier.ll (+128-25)
(modified) llvm/test/CodeGen/NVPTX/named-barriers.ll (+19-17)
(modified) llvm/test/CodeGen/NVPTX/noduplicate-syncthreads.ll (+3-3)

diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index 51bbfd0a5c88d..7e703f31775a1 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -199,7 +199,7 @@ map in the following way to CUDA builtins:
 Barriers
 --------
 
-'``llvm.nvvm.barrier0``'
+'``llvm.nvvm.barrier.cta.*``'
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Syntax:
@@ -207,13 +207,50 @@ Syntax:
 
 .. code-block:: llvm
 
-  declare void @llvm.nvvm.barrier0()
+  declare void @llvm.nvvm.barrier.cta.sync(i32 %id, i32 %n)
+  declare void @llvm.nvvm.barrier.cta.sync.all(i32 %id)
+  declare void @llvm.nvvm.barrier.cta.arrive(i32 %id, i32 %n)
+
+  declare void @llvm.nvvm.barrier.cta.sync.aligned(i32 %id, i32 %n)
+  declare void @llvm.nvvm.barrier.cta.sync.aligned.all(i32 %id)
+  declare void @llvm.nvvm.barrier.cta.arrive.aligned(i32 %id, i32 %n)
 
 Overview:
 """""""""
 
-The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
-instruction, equivalent to the ``__syncthreads()`` call in CUDA.
+The '``@llvm.nvvm.barrier.cta.*``' family of intrinsics perform barrier
+synchronization and communication within a CTA. They can be used by the threads
+within the CTA for synchronization and communication.
+
+Semantics:
+""""""""""
+
+Operand %id specifies a logical barrier resource and must fall within the range
+0 through 15. When present, operand %n specifies the number of threads
+participating in the barrier. When specifying a thread count, the value must be
+a multiple of the warp size. With the '``@llvm.nvvm.barrier.cta.sync.*``'
+variants, the '``.all``' suffix indicates that all threads in the CTA should
+participate in the barrier and the %n operand is not present.
+
+All forms of the '``@llvm.nvvm.barrier.cta.*``' intrinsic cause the executing
+thread to wait for all non-exited threads from its warp and then marks the
+warp's arrival at the barrier. In addition to signaling its arrival at the 
+barrier, the '``@llvm.nvvm.barrier.cta.sync.*``' intrinsics cause the executing
+thread to wait for non-exited threads of all other warps participating in the
+barrier to arrive. On the other hand, the '``@llvm.nvvm.barrier.cta.arrive.*``'
+intrinsic does not cause the executing thread to wait for threads of other
+participating warps.
+
+When a barrier completes, the waiting threads are restarted without delay,
+and the barrier is reinitialized so that it can be immediately reused.
+
+The '``@llvm.nvvm.barrier.cta.*``' intrinsic has an optional '``.aligned``'
+modifier to indicate textual alignment of the barrier. When specified, it
+indicates that all threads in the CTA will execute the same
+'``@llvm.nvvm.barrier.cta.*``' instruction. In conditionally executed code, an
+aligned '``@llvm.nvvm.barrier.cta.*``' instruction should only be used if it is
+known that all threads in the CTA evaluate the condition identically, otherwise
+behavior is undefined.
 
 Electing a thread
 -----------------
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index a95c739f1331d..f648815b06ab8 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -128,6 +128,12 @@
 //   * llvm.nvvm.swap.lo.hi.b64      --> llvm.fshl(x, x, 32)
 //   * llvm.nvvm.atomic.load.inc.32  --> atomicrmw uinc_wrap
 //   * llvm.nvvm.atomic.load.dec.32  --> atomicrmw udec_wrap
+// * llvm.nvvm.barrier0              --> llvm.nvvm.barrier.cta.sync.aligned.all(0)
+// * llvm.nvvm.barrier.n             --> llvm.nvvm.barrier.cta.sync.aligned.all(x)
+// * llvm.nvvm.bar.sync              --> llvm.nvvm.barrier.cta.sync.aligned.all(x)
+// * llvm.nvvm.barrier               --> llvm.nvvm.barrier.cta.sync.aligned(x, y)
+// * llvm.nvvm.barrier.sync          --> llvm.nvvm.barrier.cta.sync.all(x)
+// * llvm.nvvm.barrier.sync.cnt      --> llvm.nvvm.barrier.cta.sync(x, y)
 
 def llvm_global_ptr_ty  : LLVMQualPointerType<1>;         // (global)ptr
 def llvm_shared_ptr_ty  : LLVMQualPointerType<3>;         // (shared)ptr
@@ -1263,18 +1269,6 @@ let TargetPrefix = "nvvm" in {
   defm int_nvvm_atomic_cas_gen_i  : PTXAtomicWithScope3<llvm_anyint_ty>;
 
 // Bar.Sync
-
-  // The builtin for "bar.sync 0" is called __syncthreads.  Unlike most of the
-  // intrinsics in this file, this one is a user-facing API.
-  def int_nvvm_barrier0 : ClangBuiltin<"__syncthreads">,
-      Intrinsic<[], [], [IntrConvergent, IntrNoCallback]>;
-  // Synchronize all threads in the CTA at barrier 'n'.
-  def int_nvvm_barrier_n : ClangBuiltin<"__nvvm_bar_n">,
-      Intrinsic<[], [llvm_i32_ty], [IntrConvergent, IntrNoCallback]>;
-  // Synchronize 'm', a multiple of warp size, (arg 2) threads in
-  // the CTA at barrier 'n' (arg 1).
-  def int_nvvm_barrier : ClangBuiltin<"__nvvm_bar">,
-      Intrinsic<[], [llvm_i32_ty, llvm_i32_ty], [IntrConvergent, IntrNoCallback]>;
   def int_nvvm_barrier0_popc : ClangBuiltin<"__nvvm_bar0_popc">,
       Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrConvergent, IntrNoCallback]>;
   def int_nvvm_barrier0_and : ClangBuiltin<"__nvvm_bar0_and">,
@@ -1282,16 +1276,21 @@ let TargetPrefix = "nvvm" in {
   def int_nvvm_barrier0_or : ClangBuiltin<"__nvvm_bar0_or">,
       Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrConvergent, IntrNoCallback]>;
 
-  def int_nvvm_bar_sync : NVVMBuiltin,
-      Intrinsic<[], [llvm_i32_ty], [IntrConvergent, IntrNoCallback]>;
   def int_nvvm_bar_warp_sync : NVVMBuiltin,
       Intrinsic<[], [llvm_i32_ty], [IntrConvergent, IntrNoCallback]>;
 
-  // barrier.sync id[, cnt]
-  def int_nvvm_barrier_sync : NVVMBuiltin,
-      Intrinsic<[], [llvm_i32_ty], [IntrConvergent, IntrNoCallback]>;
-  def int_nvvm_barrier_sync_cnt : NVVMBuiltin,
-      Intrinsic<[], [llvm_i32_ty, llvm_i32_ty], [IntrConvergent, IntrNoCallback]>;
+  // barrier{.cta}.sync{.aligned}      a{, b};
+  // barrier{.cta}.arrive{.aligned}    a, b;
+  let IntrProperties = [IntrConvergent, IntrNoCallback] in {
+    foreach align = ["", "_aligned"] in {
+      def int_nvvm_barrier_cta_sync # align # _all :
+          Intrinsic<[], [llvm_i32_ty]>;
+      def int_nvvm_barrier_cta_sync # align :
+          Intrinsic<[], [llvm_i32_ty, llvm_i32_ty]>;
+      def int_nvvm_barrier_cta_arrive # align :
+          Intrinsic<[], [llvm_i32_ty, llvm_i32_ty]>;
+    }
+  }
 
   // barrier.cluster.[wait, arrive, arrive.relaxed]
   def int_nvvm_barrier_cluster_arrive :
diff --git a/llvm/lib/IR/AutoUpgrade.cpp b/llvm/lib/IR/AutoUpgrade.cpp
index 9091e7585f9d9..18f6f2bf9ed11 100644
--- a/llvm/lib/IR/AutoUpgrade.cpp
+++ b/llvm/lib/IR/AutoUpgrade.cpp
@@ -1349,6 +1349,10 @@ static bool upgradeIntrinsicFunction1(Function *F, Function *&NewFn,
       else if (Name == "clz.ll" || Name == "popc.ll" || Name == "h2f" ||
                Name == "swap.lo.hi.b64")
         Expand = true;
+      else if (Name == "barrier0" || Name == "barrier.n" ||
+               Name == "bar.sync" || Name == "barrier" ||
+               Name == "barrier.sync" || Name == "barrier.sync.cnt")
+        Expand = true;
       else if (Name.consume_front("max.") || Name.consume_front("min."))
         // nvvm.{min,max}.{i,ii,ui,ull}
         Expand = Name == "s" || Name == "i" || Name == "ll" || Name == "us" ||
@@ -2478,6 +2482,20 @@ static Value *upgradeNVVMIntrinsicCall(StringRef Name, CallBase *CI,
     MDNode *MD = MDNode::get(Builder.getContext(), {});
     LD->setMetadata(LLVMContext::MD_invariant_load, MD);
     return LD;
+  } else if (Name == "barrier0" || Name == "barrier.n" || Name == "bar.sync") {
+    Value *Arg =
+        Name.ends_with('0') ? Builder.getInt32(0) : CI->getArgOperand(0);
+    Rep = Builder.CreateIntrinsic(Intrinsic::nvvm_barrier_cta_sync_aligned_all,
+                                  {}, {Arg});
+  } else if (Name == "barrier") {
+    Rep = Builder.CreateIntrinsic(Intrinsic::nvvm_barrier_cta_sync_aligned, {},
+                                  {CI->getArgOperand(0), CI->getArgOperand(1)});
+  } else if (Name == "barrier.sync") {
+    Rep = Builder.CreateIntrinsic(Intrinsic::nvvm_barrier_cta_sync_all, {},
+                                  {CI->getArgOperand(0)});
+  } else if (Name == "barrier.sync.cnt") {
+    Rep = Builder.CreateIntrinsic(Intrinsic::nvvm_barrier_cta_sync, {},
+                                  {CI->getArgOperand(0), CI->getArgOperand(1)});
   } else {
     Intrinsic::ID IID = shouldUpgradeNVPTXBF16Intrinsic(Name);
     if (IID != Intrinsic::not_intrinsic &&
diff --git a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
index 2c65ee6d484d5..405f43af67d1d 100644
--- a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
+++ b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
@@ -240,6 +240,34 @@ def BF16RT   : RegTyInfo<bf16, Int16Regs, bf16imm, fpimm, supports_imm = 0>;
 def F16X2RT  : RegTyInfo<v2f16, Int32Regs, ?, ?, supports_imm = 0>;
 def BF16X2RT : RegTyInfo<v2bf16, Int32Regs, ?, ?, supports_imm = 0>;
 
+// This class provides a basic wrapper around an NVPTXInst that abstracts the
+// specific syntax of most PTX instructions. It automatically handles the
+// construction of the asm string based on the provided dag arguments.
+class BasicFlagsNVPTXInst<dag outs_dag, dag ins_dag, dag flags_dag, string asmstr,
+                          list<dag> pattern = []>
+  : NVPTXInst<
+      outs_dag,
+      !con(ins_dag, flags_dag),
+      !strconcat(
+        asmstr,
+        !if(!and(!empty(ins_dag), !empty(outs_dag)), "",
+          !strconcat(
+            " \t",
+            !interleave(
+              !foreach(i, !range(!size(outs_dag)),
+                "$" # !getdagname(outs_dag, i)),
+              "|"),
+            !if(!or(!empty(ins_dag), !empty(outs_dag)), "", ", "),
+            !interleave(
+              !foreach(i, !range(!size(ins_dag)),
+                "$" # !getdagname(ins_dag, i)),
+              ", "))),
+        ";"),
+      pattern>;
+
+class BasicNVPTXInst<dag outs, dag insv, string asmstr, list<dag> pattern = []>
+  : BasicFlagsNVPTXInst<outs, insv, (ins), asmstr, pattern>;
+
 
 multiclass I3Inst<string op_str, SDPatternOperator op_node, RegTyInfo t,
                   bit commutative, list<Predicate> requires = []> {
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index d3cfce76c666e..9ab0cbbb33681 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -67,15 +67,6 @@ class THREADMASK_INFO<bit sync> {
 // Synchronization and shuffle functions
 //-----------------------------------
 let isConvergent = true in {
-def INT_BARRIER0 : NVPTXInst<(outs), (ins),
-                  "bar.sync \t0;",
-      [(int_nvvm_barrier0)]>;
-def INT_BARRIERN : NVPTXInst<(outs), (ins Int32Regs:$src1),
-                  "bar.sync \t$src1;",
-      [(int_nvvm_barrier_n i32:$src1)]>;
-def INT_BARRIER : NVPTXInst<(outs), (ins Int32Regs:$src1, Int32Regs:$src2),
-                  "bar.sync \t$src1, $src2;",
-      [(int_nvvm_barrier i32:$src1, i32:$src2)]>;
 def INT_BARRIER0_POPC : NVPTXInst<(outs Int32Regs:$dst), (ins Int32Regs:$pred),
   !strconcat("{{ \n\t",
              ".reg .pred \t%p1; \n\t",
@@ -102,9 +93,6 @@ def INT_BARRIER0_OR : NVPTXInst<(outs Int32Regs:$dst), (ins Int32Regs:$pred),
              "}}"),
       [(set i32:$dst, (int_nvvm_barrier0_or i32:$pred))]>;
 
-def INT_BAR_SYNC : NVPTXInst<(outs), (ins i32imm:$i), "bar.sync \t$i;",
-                             [(int_nvvm_bar_sync imm:$i)]>;
-
 def INT_BAR_WARP_SYNC_I : NVPTXInst<(outs), (ins i32imm:$i), "bar.warp.sync \t$i;",
                              [(int_nvvm_bar_warp_sync imm:$i)]>,
         Requires<[hasPTX<60>, hasSM<30>]>;
@@ -112,29 +100,44 @@ def INT_BAR_WARP_SYNC_R : NVPTXInst<(outs), (ins Int32Regs:$i), "bar.warp.sync \
                              [(int_nvvm_bar_warp_sync i32:$i)]>,
         Requires<[hasPTX<60>, hasSM<30>]>;
 
-def INT_BARRIER_SYNC_I : NVPTXInst<(outs), (ins i32imm:$i), "barrier.sync \t$i;",
-                                   [(int_nvvm_barrier_sync imm:$i)]>,
-        Requires<[hasPTX<60>, hasSM<30>]>;
-def INT_BARRIER_SYNC_R : NVPTXInst<(outs), (ins Int32Regs:$i), "barrier.sync \t$i;",
-                                   [(int_nvvm_barrier_sync i32:$i)]>,
-        Requires<[hasPTX<60>, hasSM<30>]>;
+multiclass BARRIER1<string asmstr, Intrinsic intrinsic, list<Predicate> requires = []> {
+  def _i : BasicNVPTXInst<(outs), (ins i32imm:$i), asmstr,
+                          [(intrinsic imm:$i)]>,
+           Requires<requires>;
 
-def INT_BARRIER_SYNC_CNT_RR : NVPTXInst<(outs), (ins Int32Regs:$id, Int32Regs:$cnt),
-                 "barrier.sync \t$id, $cnt;",
-                 [(int_nvvm_barrier_sync_cnt i32:$id, i32:$cnt)]>,
-        Requires<[hasPTX<60>, hasSM<30>]>;
-def INT_BARRIER_SYNC_CNT_RI : NVPTXInst<(outs), (ins Int32Regs:$id, i32imm:$cnt),
-                 "barrier.sync \t$id, $cnt;",
-                 [(int_nvvm_barrier_sync_cnt i32:$id, imm:$cnt)]>,
-        Requires<[hasPTX<60>, hasSM<30>]>;
-def INT_BARRIER_SYNC_CNT_IR : NVPTXInst<(outs), (ins i32imm:$id, Int32Regs:$cnt),
-                 "barrier.sync \t$id, $cnt;",
-                 [(int_nvvm_barrier_sync_cnt imm:$id, i32:$cnt)]>,
-        Requires<[hasPTX<60>, hasSM<30>]>;
-def INT_BARRIER_SYNC_CNT_II : NVPTXInst<(outs), (ins i32imm:$id, i32imm:$cnt),
-                 "barrier.sync \t$id, $cnt;",
-                 [(int_nvvm_barrier_sync_cnt imm:$id, imm:$cnt)]>,
-        Requires<[hasPTX<60>, hasSM<30>]>;
+  def _r : BasicNVPTXInst<(outs), (ins Int32Regs:$i), asmstr,
+                          [(intrinsic i32:$i)]>,
+           Requires<requires>;
+}
+
+multiclass BARRIER2<string asmstr, Intrinsic intrinsic, list<Predicate> requires = []> {
+  def _rr : BasicNVPTXInst<(outs), (ins Int32Regs:$i, Int32Regs:$j), asmstr,
+                          [(intrinsic i32:$i, i32:$j)]>,
+            Requires<requires>;
+
+  def _ri : BasicNVPTXInst<(outs), (ins Int32Regs:$i, i32imm:$j), asmstr,
+                          [(intrinsic i32:$i, imm:$j)]>,
+            Requires<requires>;
+
+  def _ir : BasicNVPTXInst<(outs), (ins i32imm:$i, Int32Regs:$j), asmstr,
+                          [(intrinsic imm:$i, i32:$j)]>,
+            Requires<requires>;
+
+  def _ii : BasicNVPTXInst<(outs), (ins i32imm:$i, i32imm:$j), asmstr,
+                          [(intrinsic imm:$i, imm:$j)]>,
+            Requires<requires>;
+}
+
+// Note the "bar.sync" variants could be renamed to the equivalent corresponding
+// "barrier.*.aligned" variants. We use the older syntax for compatibility with
+// older versions of the PTX ISA.
+defm BARRIER_CTA_SYNC_ALIGNED_ALL : BARRIER1<"bar.sync", int_nvvm_barrier_cta_sync_aligned_all>;
+defm BARRIER_CTA_SYNC_ALIGNED : BARRIER2<"bar.sync", int_nvvm_barrier_cta_sync_aligned>;
+defm BARRIER_CTA_ARRIVE_ALIGNED : BARRIER2<"bar.arrive", int_nvvm_barrier_cta_arrive_aligned>;
+
+defm BARRIER_CTA_SYNC_ALL : BARRIER1<"barrier.sync", int_nvvm_barrier_cta_sync_all, [hasPTX<60>]>;
+defm BARRIER_CTA_SYNC : BARRIER2<"barrier.sync", int_nvvm_barrier_cta_sync, [hasPTX<60>]>;
+defm BARRIER_CTA_ARRIVE : BARRIER2<"barrier.arrive", int_nvvm_barrier_cta_arrive, [hasPTX<60>]>;
 
 class INT_BARRIER_CLUSTER<string variant, Intrinsic Intr,
                           list<Predicate> Preds = [hasPTX<78>, hasSM<90>]>:
diff --git a/llvm/lib/Transforms/IPO/AttributorAttributes.cpp b/llvm/lib/Transforms/IPO/AttributorAttributes.cpp
index 8b843634600be..79d9b3da054b5 100644
--- a/llvm/lib/Transforms/IPO/AttributorAttributes.cpp
+++ b/llvm/lib/Transforms/IPO/AttributorAttributes.cpp
@@ -2150,7 +2150,8 @@ struct AANoUnwindCallSite final
 
 bool AANoSync::isAlignedBarrier(const CallBase &CB, bool ExecutedAligned) {
   switch (CB.getIntrinsicID()) {
-  case Intrinsic::nvvm_barrier0:
+  case Intrinsic::nvvm_barrier_cta_sync_aligned_all:
+  case Intrinsic::nvvm_barrier_cta_sync_aligned:
   case Intrinsic::nvvm_barrier0_and:
   case Intrinsic::nvvm_barrier0_or:
   case Intrinsic::nvvm_barrier0_popc:
diff --git a/llvm/test/Assembler/auto_upgrade_nvvm_intrinsics.ll b/llvm/test/Assembler/auto_upgrade_nvvm_intrinsics.ll
index 2bfa1c2dfba7a..e362ad88a8c0d 100644
--- a/llvm/test/Assembler/auto_upgrade_nvvm_intrinsics.ll
+++ b/llvm/test/Assembler/auto_upgrade_nvvm_intrinsics.ll
@@ -78,6 +78,13 @@ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.3d(ptr addrspace(3) %d,
 declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.4d(ptr addrspace(3) %d, ptr addrspace(3) %bar, ptr %tm, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i16 %im2col0, i16 %im2col1, i16 %mc, i64 %ch, i1 %f1, i1 %f2);
 declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.5d(ptr addrspace(3) %d, ptr addrspace(3) %bar, ptr %tm, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i16 %im2col0, i16 %im2col1, i16 %im2col2, i16 %mc, i64 %ch, i1 %f1, i1 %f2);
 
+declare void @llvm.nvvm.barrier0()
+declare void @llvm.nvvm.barrier.n(i32)
+declare void @llvm.nvvm.bar.sync(i32)
+declare void @llvm.nvvm.barrier(i32, i32)
+declare void @llvm.nvvm.barrier.sync(i32)
+declare void @llvm.nvvm.barrier.sync.cnt(i32, i32)
+
 ; CHECK-LABEL: @simple_upgrade
 define void @simple_upgrade(i32 %a, i64 %b, i16 %c) {
 ; CHECK: call i32 @llvm.bitreverse.i32(i32 %a)
@@ -324,3 +331,18 @@ define void @nvvm_cp_async_bulk_tensor_g2s_tile(ptr addrspace(3) %d, ptr addrspa
   ret void
 }
 
+define void @cta_barriers(i32 %x, i32 %y) {
+; CHECK: call void @llvm.nvvm.barrier.cta.sync.aligned.all(i32 0)
+; CHECK: call void @llvm.nvvm.barrier.cta.sync.aligned.all(i32 %x)
+; CHECK: call void @llvm.nvvm.barrier.cta.sync.aligned.all(i32 %x)
+; CHECK: call void @llvm.nvvm.barrier.cta.sync.aligned(i32 %x, i32 %y)
+; CHECK: call void @llvm.nvvm.barrier.cta.sync.all(i32 %x)
+; CHECK: call void @llvm.nvvm.barrier.cta.sync(i32 %x, i32 %y)
+  call void @llvm.nvvm.barrier0()
+  call void @llvm.nvvm.barrier.n(i32 %x)
+  call void @llvm.nvvm.bar.sync(i32 %x)
+  call void @llvm.nvvm.barrier(i32 %x, i32 %y)
+  call void @llvm.nvvm.barrier.sync(i32 %x)
+  call void @llvm.nvvm.barrier.sync.cnt(i32 %x, i32 %y)
+  ret void
+}
diff --git a/llvm/test/CodeGen/NVPTX/barrier.ll b/llvm/test/CodeGen/NVPTX/barrier.ll
index 05bdc9087f572..75db99b7f49dd 100644
--- a/llvm/test/CodeGen/NVPTX/barrier.ll
+++ b/llvm/test/CodeGen/NVPTX/barrier.ll
@@ -1,33 +1,136 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
 ; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_30 -mattr=+ptx60 | FileCheck %s
 ; RUN: %if ptxas %{ llc < %s -mtriple=nvptx64 -mcpu=sm_30 -mattr=+ptx60 | %ptxas-verify %}
 
 declare void @llvm.nvvm.bar.warp.sync(i32)
-declare void @llvm.nvvm.barrier.sync(i32)
-declare void @llvm.nvvm.barrier.sync.cnt(i32, i32)
-
-; CHECK-LABEL: .func{{.*}}barrier_sync
-define void @barrier_sync(i32 %id, i32 %cnt) {
-  ; CHECK: ld.param.b32 	[[ID:%r[0-9]+]], [barrier_sync_param_0];
-  ; CHECK: ld.param.b32 	[[CNT:%r[0-9]+]], [barrier_sync_param_1];
-
-  ; CHECK:  barrier.sync [[ID]], [[CNT]];
-  call void @llvm.nvvm.barrier.sync.cnt(i32 %id, i32 %cnt)
-  ; CHECK:  barrier.sync [[ID]], 32;
-  call void @llvm.nvvm.barrier.sync.cnt(i32 %id, i32 32)
-  ; CHECK:  barrier.sync 3, [[CNT]];
-  call void @llvm.nvvm.barrier.sync.cnt(i32 3, i32 %cnt)
-  ; CHECK:  barrier.sync 4, 64;
-  call void @llvm.nvvm.barrier.sync.cnt(i32 4, i32 64)
-
-  ; CHECK: barrier.sync [[ID]];
-  call void @llvm.nvvm.barrier.sync(i32 %id)
-  ; CHECK: barrier.sync 1;
-  call void @llvm.nvvm.barrier.sync(i32 1)
-
-  ; CHECK: bar.warp.sync [[ID]];
+declare void @llvm.nvvm.barrier.cta.sync.aligned.all(i32)
+declare void @llvm.nvvm.barrier.cta.sync.aligned(i32, i32)
+declare void @llvm.nvvm.barrier.cta.sync.all(i32)
+declare void @llvm.nvvm.barrier.cta.sync(i32, i32)
+declare void @llvm.nvvm.barrier.cta.arrive(i32, i32)
+declare void @llvm.nvvm.barrier.cta.arrive.aligned(i32, i32)
+
+define void @barrier_warp_sync(i32 %id) {
+; CHECK-LABEL: barrier_warp_sync(
+; CHECK:       {
+; CHECK-NEXT:    .reg .b32 %r<2>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.b32 %r1, [barrier_warp_sync_param_0];
+; CHECK-NEXT:    bar.warp.sync %r1;
+; CHECK-NEXT:    bar.warp.sync 6;
+; CHECK-NEXT:    ret;
   call void @llvm.nvvm.bar.warp.sync(i32 %id)
-  ; CHECK: bar.warp.sync 6;
   call void @llvm.nvvm.bar.warp.sync(i32 6)
-  ret void;
+  ret void
+}
+
+define void @barrier_cta_sync_aligned_all(i32 %id) {
+; CHECK-LABEL: barrier_cta_sync_aligned_all(
+; CHECK:       {
+; CHECK-NEXT:    .reg .b32 %r<2>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.b32 %r1, ...
[truncated]

Artem-B

Nice. I like the new naming scheme.

LGTM w/ a few nits.

llvm/lib/IR/AutoUpgrade.cpp

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td

llvm/test/CodeGen/NVPTX/named-barriers.ll

durga4github · 2025-05-20T07:15:28Z

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td

+//                         (ins Int32Regs:$a, Int32Regs:$b, Hexu32imm:$c),
+//                         (ins PrmtMode:$mode),
+//                         "prmt.b32${mode}">;
+//         ---> "prmt.b32${mode} \t$dst, $a, $b, $c;"


I think you meant $d here, for the output value

llvm/lib/Target/NVPTX/NVPTXIntrinsics.td

durga4github · 2025-05-20T07:18:21Z

llvm/docs/NVPTXUsage.rst

+'``@llvm.nvvm.barrier.cta.*``' instruction. In conditionally executed code, an
+aligned '``@llvm.nvvm.barrier.cta.*``' instruction should only be used if it is
+known that all threads in the CTA evaluate the condition identically, otherwise
+behavior is undefined.


Shall we add a link to the PTX ISA here?

I think the PTX instruction used to lower the intrinsic is an implementation detail. It should not be necessary to link to it if the semantics of the intrinsic are fully defined.

grypp · 2025-05-20T09:45:52Z

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td

+def NVVM_Barrier0Op : NVVM_Op<"barrier0"> {
  let assemblyFormat = "attr-dict";
+  string llvmBuilder = [{
+      createIntrinsicCall(
+          builder, llvm::Intrinsic::nvvm_barrier_cta_sync_aligned_all,
+          {builder.getInt32(0)});
+  }];


you can remove this op completely actually.

I agree this op can be completely removed but doing so is not trivial. I'd prefer to leave this for a subsequent change. CC @durga4github

Sure, Alex. I will take care of this in a separate change.

grypp · 2025-05-20T09:46:10Z

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td

 //===----------------------------------------------------------------------===//

-def NVVM_Barrier0Op : NVVM_IntrOp<"barrier0"> {
+def NVVM_Barrier0Op : NVVM_Op<"barrier0"> {


do you need to change NVVM_IntrOp->NVVM_Op?

Yep. Otherwise, references to the no-longer present Intrinsic::barrier0 get generated by this record such as in NVVMConvertibleLLVMIRIntrinsics.inc

grypp · 2025-05-20T09:46:57Z

mlir/test/Target/LLVMIR/Import/nvvmir.ll

-; CHECK-LABEL: @llvm_nvvm_barrier0()
-define void @llvm_nvvm_barrier0() {
-  ; CHECK: nvvm.barrier0
-  call void @llvm.nvvm.barrier0()
-  ret void
-}
-


test removal here. is it accident?

I've added this test back.

grypp · 2025-05-20T09:47:32Z

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td

-    } else {
-      createIntrinsicCall(builder, llvm::Intrinsic::nvvm_barrier0);
-    }
+    auto id = $barrierId ? $barrierId : builder.getInt32(0);


LLVM code doesn't use auto when the type isn't obvious. Let's use the type

The simple form of `Barrier0Op` is available in the NVVM dialect. It is needed to use it instead of the string version since #140615

The simple form of `Barrier0Op` is available in the NVVM dialect. It is needed to use it instead of the string version since llvm/llvm-project#140615

…0615)" This reverts commit 735209c.

mcourteaux · 2025-05-22T18:58:35Z

This merge broke our builds on Halide.

Unhandled exception: Error: Could not find PTX barrier intrinsic (llvm.nvvm.barrier0)

We have an .ll file declaring these intrinsics:

declare void @llvm.nvvm.barrier0()

So, it seems the auto-upgrade mechanism doesn't work. Any ideas?

AlexMaclean · 2025-05-22T19:14:11Z

This merge broke our builds on Halide.
Unhandled exception: Error: Could not find PTX barrier intrinsic (llvm.nvvm.barrier0)
We have an .ll file declaring these intrinsics:
declare void @llvm.nvvm.barrier0()
So, it seems the auto-upgrade mechanism doesn't work. Any ideas?

@mcourteaux it looks to me like the error you're seeing is coming from here:

https://github.com/halide/Halide/blob/85a3b07fab4ce07a9747d645dd1274c3f1c29b44/src/CodeGen_PTX_Dev.cpp#L265-L267

I reverted this change, but I plan to reland soon. Once I do you'll need to update this code to reference the new intrinsic.

mcourteaux · 2025-05-22T19:40:21Z

Hi @AlexMaclean! Thanks for the quick reply. I didn't realize we tried to obtain if through this API (I suspected this went through the LLVM ll files). The auto-upgrade system doesn't seem to work for the llvm::Module::getFunction() call. No need to revert: our goal is to keep up to date with LLVM. Halide pulls LLVM once per day. Thanks for pointing out where our problem was. You can undo the revert. Thanks a lot!

Update: Halide is patched: https://github.com/halide/Halide/pull/8631/files Feel free to recommit this PR! 😄

Our current intrinsic support for barrier intrinsics is confusing and incomplete, with multiple intrinsics mapping to the same instruction and intrinsic names not clearly conveying intrinsic semantics. Further, we lack support for some variants. This change unifies the IR representation to a single consistently named set of intrinsics. - llvm.nvvm.barrier.cta.sync.aligned.all(i32) - llvm.nvvm.barrier.cta.sync.aligned(i32, i32) - llvm.nvvm.barrier.cta.arrive.aligned(i32, i32) - llvm.nvvm.barrier.cta.sync.all(i32) - llvm.nvvm.barrier.cta.sync(i32, i32) - llvm.nvvm.barrier.cta.arrive(i32, i32) The following Auto-Upgrade rules are used to maintain compatibility with IR using the legacy intrinsics: * llvm.nvvm.barrier0 --> llvm.nvvm.barrier.cta.sync.aligned.all(0) * llvm.nvvm.barrier.n --> llvm.nvvm.barrier.cta.sync.aligned.all(x) * llvm.nvvm.bar.sync --> llvm.nvvm.barrier.cta.sync.aligned.all(x) * llvm.nvvm.barrier --> llvm.nvvm.barrier.cta.sync.aligned(x, y) * llvm.nvvm.barrier.sync --> llvm.nvvm.barrier.cta.sync.all(x) * llvm.nvvm.barrier.sync.cnt --> llvm.nvvm.barrier.cta.sync(x, y)

…41143) Note: This relands #140615 adding a ".count" suffix to the non-".all" variants. Our current intrinsic support for barrier intrinsics is confusing and incomplete, with multiple intrinsics mapping to the same instruction and intrinsic names not clearly conveying intrinsic semantics. Further, we lack support for some variants. This change unifies the IR representation to a single consistently named set of intrinsics. - llvm.nvvm.barrier.cta.sync.aligned.all(i32) - llvm.nvvm.barrier.cta.sync.aligned.count(i32, i32) - llvm.nvvm.barrier.cta.arrive.aligned.count(i32, i32) - llvm.nvvm.barrier.cta.sync.all(i32) - llvm.nvvm.barrier.cta.sync.count(i32, i32) - llvm.nvvm.barrier.cta.arrive.count(i32, i32) The following Auto-Upgrade rules are used to maintain compatibility with IR using the legacy intrinsics: * llvm.nvvm.barrier0 --> llvm.nvvm.barrier.cta.sync.aligned.all(0) * llvm.nvvm.barrier.n --> llvm.nvvm.barrier.cta.sync.aligned.all(x) * llvm.nvvm.bar.sync --> llvm.nvvm.barrier.cta.sync.aligned.all(x) * llvm.nvvm.barrier --> llvm.nvvm.barrier.cta.sync.aligned.count(x, y) * llvm.nvvm.barrier.sync --> llvm.nvvm.barrier.cta.sync.all(x) * llvm.nvvm.barrier.sync.cnt --> llvm.nvvm.barrier.cta.sync.count(x, y)

AlexMaclean added 2 commits May 15, 2025 18:12

[NVPTX] Unify and extend barrier{.cta} intrinsic support

babb28e

update docs

f8b9820

AlexMaclean requested review from Artem-B, durga4github and justinfargnoli May 19, 2025 20:41

AlexMaclean self-assigned this May 19, 2025

llvmbot added backend:NVPTX llvm:ir llvm:transforms labels May 19, 2025

Artem-B approved these changes May 19, 2025

View reviewed changes

llvm/lib/IR/AutoUpgrade.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td Show resolved Hide resolved

llvm/test/CodeGen/NVPTX/named-barriers.ll Outdated Show resolved Hide resolved

llvm/test/CodeGen/NVPTX/named-barriers.ll Outdated Show resolved Hide resolved

address comments, fixup clang mlir

6f81653

AlexMaclean requested a review from grypp as a code owner May 19, 2025 23:03

llvmbot added clang Clang issues not falling into any other category clang:codegen IR generation bugs: mangling, exceptions, etc. mlir:llvm mlir llvm:analysis Includes value tracking, cost tables and constant folding clang:openmp OpenMP related changes to Clang labels May 19, 2025

more test fixup

5d08ecc

durga4github reviewed May 20, 2025

View reviewed changes

llvm/lib/Target/NVPTX/NVPTXIntrinsics.td Show resolved Hide resolved

durga4github reviewed May 20, 2025

View reviewed changes

durga4github approved these changes May 20, 2025

View reviewed changes

grypp reviewed May 20, 2025

View reviewed changes

address comments

6b35d18

grypp approved these changes May 21, 2025

View reviewed changes

AlexMaclean merged commit 735209c into llvm:main May 21, 2025
12 checks passed

clementval mentioned this pull request May 21, 2025

[flang][cuda] Use NVVM op for barrier0 intrinsic #140947

Merged

clementval added a commit that referenced this pull request May 21, 2025

[flang][cuda] Use NVVM op for barrier0 intrinsic (#140947)

89d9a83

The simple form of `Barrier0Op` is available in the NVVM dialect. It is needed to use it instead of the string version since #140615

AlexMaclean added a commit that referenced this pull request May 22, 2025

Revert "[NVPTX] Unify and extend barrier{.cta} intrinsic support (#14…

e72d8b2

…0615)" This reverts commit 735209c.

AlexMaclean mentioned this pull request May 22, 2025

Reland "[NVPTX] Unify and extend barrier{.cta} intrinsic support" #141143

Merged

mcourteaux mentioned this pull request May 22, 2025

Fix top LLVM: renamed NVPTX barrier intrinsics. halide/Halide#8631

Merged

[NVPTX] Unify and extend barrier{.cta} intrinsic support #140615

[NVPTX] Unify and extend barrier{.cta} intrinsic support #140615

Uh oh!

Conversation

AlexMaclean commented May 19, 2025

Uh oh!

llvmbot commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Artem-B left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grypp May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mcourteaux commented May 22, 2025

Uh oh!

AlexMaclean commented May 22, 2025

Uh oh!

mcourteaux commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

llvmbot commented May 19, 2025 •

edited

Loading

grypp May 20, 2025 •

edited

Loading

mcourteaux commented May 22, 2025 •

edited

Loading