Skip to content

Conversation

@valassi
Copy link
Member

@valassi valassi commented Oct 6, 2025

Hi @oliviermattelaer, as discussed recently and as per my presentation at the MG5AMC meeting last Friday.

This is the PR for my kernel splitting changes that I recommend merging:

  • ihel1: helicity streams
  • ihel2: color sum kernels
  • ihel3: color sum in BLAS

I have prepared a paper that will be shortly in arxiv with all the details.

Until yesterday, it would have been possible to merge this automatically, as I had merged the latest upstream in my developments. Yesterday there were some new changes merged (for tREX I think), so this will need some massaging. I can do that later on, or let me know how you want to proceed.

Thanks, Andrea

Revert "[hack_ihel] first results for hack_ihel - quite confusing, some failures, some faster, some slower"
This reverts commit 717157c.
…functionally ok, faster for ggttg* ! (slower for ggtt and eemumu)
… code except gg_tt.sa, and also tput and CODEGEN, from last ancestor

This should limit the conflicts to fix to gg_tt.sa
I will then need to recreate CODEGEN from scratch, but I want to change gg_tt.sa first to make it closer to upstream/master
(in particular, split again sigmakin_getgoodhel into two separate implementations)

Note that all developments in hack_ihel in Nov2022 were made on gg_tt.sa (by AV) and on CODEGEN (by OM)

base=$(git merge-base upstream/master HEAD)
dirs=$(git ls-tree --name-only HEAD *.mad *.sa | grep -v gg_tt.sa)
\rm -rf $dirs
git checkout $base $dirs
git add $dirs
git checkout $base tput
git checkout $base CODEGEN

Note that the only differences to $base are in gg_tt.sa:

git diff $base --name-only
epochX/cudacpp/gg_tt.sa/CODEGEN_cudacpp_gg_tt_log.txt
epochX/cudacpp/gg_tt.sa/SubProcesses/MatrixElementKernels.cc
epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc
epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.h
epochX/cudacpp/gg_tt.sa/src/mgOnGpuConfig.h
…goodhel CUDA/C++ implementations to ease the merging of upstream/master
…tgoodhel CUDA/C++ implementations to ease the merging of upstream/master
Fix conflicts:
	epochX/cudacpp/gg_tt.sa/CODEGEN_cudacpp_gg_tt_log.txt
	epochX/cudacpp/gg_tt.sa/SubProcesses/MatrixElementKernels.cc
	epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc
	epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.h
	epochX/cudacpp/gg_tt.sa/src/mgOnGpuConfig.h

NB the code does not build yet for CUDA - there are more fundamental issues to be fixed, e.g. in random color/helicity selection
Instead it builds ok and the tests are successful for C++ (FPTYPE=d,f,m)

--

Long description of changes added during the Nov 2022 hackathon
(differences between upstream/hack_ihel and the merge base with upstream/master)
I will also add something similar to CODEGEN

CPPProcess.h
- About sigmaKin_getGoodHel.
  This was a kernel (__global__), it is now a function inside which a kernel is launched.
  This had different signatures for C++ and CUDA (with/without nevt), it now has the same signature (with nevt).
- About sigmaKin.
  This was a kernel (__global__), it is now a function inside which a kernel is launched.
  The signature is still different for C++ (with nevt) and CUDA (but here ngpublocks and ngputhreads has been added).
- About normaliseoutput.
  This is a new kernel function (__global__), normalising allMEs from numerators and denominators.
  Note: for the moment this is only defined for CUDA.

CPPProcess.cc
- About cGoodHel and cNGooodHel.
  For CUDA these are no longer in GPU constant memory (replaced by static variables as in C++).
  This is because the loop over helicities is now done by a C++ function calling kernels, not within a kernel.
- About calculate_wavefunctions.
  This was a __device__ function, it is now a kernel (__global__).
- About sigmaKin_getGoodHel.
  This was a kernel (__global__), it is now a function inside which a kernel is launched.
  This had different signatures for C++ and CUDA (with/without nevt), it now has the same signature (with nevt).
  This had two completely separate implementations for C++ and CUDA, now it is the same single one with ifdefs inside.
  The CUDA implementation now calls calculate_wavefunction as a kernel, no longer as a device function.
  The CUDA implementation no longer copies the list of good helicities to the GPU constant memory.
- About sigmaKin.
  This was a kernel (__global__), it is now a function inside which a kernel is launched.
  The signature is still different for C++ (with nevt) and CUDA (but here ngpublocks and ngputhreads has been added).
  The CUDA implementation now calls calculate_wavefunction as a kernel, no longer as a device function.
  The CUDA implementation now normalises allMEs from numerators and denominators through the normaliseoutput kernel.
  Note: for the moment the C++ implementation still normalises allMEs directly inside calculate_wavefunction.
- About normaliseoutput.
  This is a new kernel function (__global__), normalising allMEs from numerators and denominators.
  Note: for the moment this is only defined for CUDA.

MatrixElementKernels.cc
- About MatrixElementKernelDevice::computeGoodHelicities.
  The devIsGoodHel data buffer is no longer needed (as helicities remain on the host).
  Here sigmakin_getGoodHel is called as a function and no longer as a kernel.
- About MatrixElementKernelDevice::computeMatrixElements.
  Here sigmakin is called as a function and no longer as a kernel.
  Note that MGONGPU_NSIGHT_DEBUG is no longer supported (well it should be inside sigmaKin).
… (use the GPU abstraction layer)

As in the previous commit:
- This succeeds: "CUDA_HOME=none make; ./runTest.exe"
- This fails: "make"
…h, check_sa.cc), WIP towards fixing cuda builds

These changes essentially change the ME buffer from a simple array to an AOSOA (from 1 to ncomb+1 elements per event)

However this now breaks C++ builds too - now this fails: "CUDA_HOME=none make -j"
…as "ihel" for consistency with the rest of the code
… files), fix "CUDA_HOME=none make" builds

This completes the C++ build when moving ME buffers from 1 per event to ncomb+1 per event

However runTest.exe for C++ is now failing (output ME is 0)
Not 100% clear to me why in Feb 2024 I went to one ME per helicity: functionality wise can start without that? try to revert...
…ers that include only the sum over helicities

This essentially goes back to commit cabcb0d (26 Feb 2024):
- This succeeds: "CUDA_HOME=none make; ./runTest.exe"
- This fails: "make"

Revert "[hack_ihel] more WIP in gg_tt.sa CPPProcess.cc - C++ runTest still fails"
This reverts commit 97ea53c.

Revert "[hack_ihel] more WIP in gg_tt.sa CPPProcess.cc - C++ runTest still fails, but at least now ME is not 0..."
This reverts commit 98e3f58.

Revert "[hack_ihel] in gg_tt.sa (MemoryAccessMatrixElements.h and various .cc files), fix "CUDA_HOME=none make" builds"
This reverts commit fb4aac1.

Revert "[hack_ihel] in gg_tt.sa MemoryAccessMatrixElements.h, rename "icomb" as "ihel" for consistency with the rest of the code"
This reverts commit c65f089.

Revert "[hack_ihel] in gg_tt.sa (MemoryAccessMatrixElements.h, MemoryBuffers.h, check_sa.cc), WIP towards fixing cuda builds"
This reverts commit 982f0e9.
…pMatrixElementsAux), however runTest and gcheck both fail

This is for runTest:
[----------] 1 test from SIGMA_SM_GG_TTX_GPU/MadgraphTest
[ RUN      ] SIGMA_SM_GG_TTX_GPU/MadgraphTest.CompareMomentaAndME/0
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
ERROR! assertGpu: 'an illegal memory access was encountered' (700) in MemoryBuffers.h:530
runTest.exe: GpuRuntime.h:26: void assertGpu(cudaError_t, const char*, int, bool): Assertion `code == gpuSuccess' failed.
Aborted (core dumped)

This is for gcheck (including gdb info):
cuda-gdb --args ./gcheck.exe -p 1 8 1
Thread 1 "gcheck.exe" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
mg5amcGpu::calculate_wavefunctions<<<(1,1,1),(8,1,1)>>> () at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:288
288               jamp2_sv[ncolor * iParity + icolC] += cxabs2( jamp_sv[icolC] );

(Note: "CUDA_HOME=none make -j; ./runTest.exe" succeeds instead)
…r of CPUTest and GPUTest

This ensures that the auxiliary memory buffers for hel/col selection created in computeGoodHelicities are kept across iterations

This fixes the segfault previously observed in runTest.exe: now both C++ and CUDA builds and tests are all ok in gg_tt.sa!
Fix conflicts:
- epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/runTest.cc
- epochX/cudacpp/gg_tt.sa/SubProcesses/MatrixElementKernels.cc
- epochX/cudacpp/gg_tt.sa/SubProcesses/MatrixElementKernels.h
- epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc
- epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.h
- epochX/cudacpp/gg_tt.sa/SubProcesses/runTest.cc
- epochX/cudacpp/gg_tt.sa/src/mgOnGpuConfig.h

Note: there are some build warnings in gg_tt.sa for cuda after the merge
(maybe this is simply because in upstream/master warnings are enabled in the compiler?)

ccache /usr/local/cuda-12.0/bin/nvcc  -I. -I../../src  -Xcompiler -O3 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -lineinfo -use_fast_math -Xcompiler -Wunused-parameter -I/usr/local/cuda-12.0/include/ -DUSE_NVTX  -std=c++17  -ccbin /usr/lib64/ccache/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_FLOAT -Xcompiler -fPIC -DMGONGPU_CHANNELID_DEBUG -c -x cu CPPProcess.cc -o CPPProcess_cuda.o
CPPProcess.cc(999): warning madgraph5#177-D: variable "ievt" was declared but never referenced

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

CPPProcess.cc: In function ‘void mg5amcGpu::sigmaKin(const fptype*, const fptype*, const fptype*, const fptype*, mgOnGpu::fptype*, int*, int*, int, int, mgOnGpu::fptype*, mgOnGpu::fptype*)’:
CPPProcess.cc:963:1: warning: unused parameter ‘allrndcol’ [-Wunused-parameter]
  962 |             const fptype* allrndhel,            // input: random numbers[nevt] for helicity selection
      |             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  963 |             const fptype* allrndcol,            // input: random numbers[nevt] for color selection
      | ^
CPPProcess.cc:971:1: warning: unused parameter ‘allselcol’ [-Wunused-parameter]
  970 |             int* allselhel,                     // output: helicity selection[nevt]
      |             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  971 |             int* allselcol,                     // output: helicity selection[nevt]
      | ^
… a build warning

(NB: a more comprehensive fix also covering allrndcol and and allselcol must be implemented in .mad)
…PPProcess.cc) using one kernel per helicity

This is based on work from Olivier and myself at the Lugano hackathon in Sep 2022, plus my later owrk in Feb and Oct 2024
… using one kernel per helicity

This is based on work from Olivier and myself at the Lugano hackathon in Sep 2022, plus my later owrk in Feb and Oct 2024
…rnel per helicity

The C++ builds and tests are ok
The CUDA builds instead fail
… selection in CUDA when multichannel is enabled

Now the builds succeed and check.exe returns the same average ME as C++.

However runTest fails in the color selection:
  ME  8.466265863460649e-02
r.ME  8.466265911883228e-02
  ChanId       1
r.ChanId       1
  SelHel       8
r.SelHel       8
  SelCol       1
r.SelCol       2
… selection in CUDA - runTest still fails

  ME  7.095177923073470e-02
r.ME  7.095177937109977e-02
  ChanId       1
r.ChanId       1
  SelHel      12
r.SelHel      12
  SelCol       1
r.SelCol       2
…fter helicity loop and before select_col - but runTest still fails
…esses thanks to the fix by Olivier

(But only in processes that I normally do not test like pp_tt012j and nobm_pp_ttW)
@valassi
Copy link
Member Author

valassi commented Oct 13, 2025

Hi @oliviermattelaer again, voila, I included my latest changes. Ready for review again.

I think that the only pending point is the color matrix codegen vs mg5amc. As discussed, I would merge now and then do an upgrade to a newer mg5amc in another MR. Up to you Olivier.

Thanks!
Andrea

@valassi
Copy link
Member Author

valassi commented Oct 13, 2025

There are some failures in the CI. I am debugging this.

@valassi valassi force-pushed the hack_ihel3_sep25_pr branch from 85b63e8 to 7568aa0 Compare October 13, 2025 15:35
@Qubitol
Copy link
Collaborator

Qubitol commented Oct 13, 2025

@valassi for the issues in the CI, have a look at #1051, me and @roiser are looking into that, since it affects other branches independently.
It seems there is an AVX512 instruction in the cppnone build, coming from Googletest.

…ll processes - bug fix for cuda builds in no-multichannel mode
@valassi valassi force-pushed the hack_ihel3_sep25_pr branch from 7568aa0 to e0d061b Compare October 13, 2025 15:54
@valassi
Copy link
Member Author

valassi commented Oct 13, 2025

@valassi for the issues in the CI, have a look at #1051, me and @roiser are looking into that, since it affects other branches independently. It seems there is an AVX512 instruction in the cppnone build, coming from Googletest.

Thanks Daniele! Indeed I have the impression that the failure has nothing to do with the cudacpp code itself.

(Anyway it is actually good I tested manually - there was a bug not caught by the CI, as no .sa are tested with cuda)

About the CI issue: if you suspect a problem in SIMD builds of googletest, maybe use no-SIMD as cudacpp/none? This is "-mno-sse". Just an idea... I can try it here if you want (or you can try it here too!)

Andrea

@Qubitol
Copy link
Collaborator

Qubitol commented Oct 13, 2025

Thanks Andrea, indeed the issue was in googletest being built for nom compatible ISA.
Now we forced avx2.

@Qubitol
Copy link
Collaborator

Qubitol commented Oct 14, 2025

Hi @valassi, the problem has been fixed in #1052. We will soon merge and then you can merge it back in your branch.

Copy link
Member

@oliviermattelaer oliviermattelaer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I have checked with the latest version and (slightly surprisingly to me) no issue with the fact that I change the standard format for the color matrix (but this is also good news).
So I'm happy to approve.
Thanks for the additional options. I'm looking forward to see the updated plot with those.

@Qubitol, I guess we can merge this directly and then do another PR for the CI?
(in case then you can click on the merge button)
Or you prefer another order?

Thanks,

Olivier

@valassi
Copy link
Member Author

valassi commented Oct 15, 2025

Hi @oliviermattelaer thanks for approving!

The new plots for the color sums are here

  • dashed line ihel3p1 uses kernels like the previous ihel3, but it is better because it uses a triangular matrix rather than a square matrix
  • dashed line ihel3p1b uses blas like the previous ihel3b, but it is better because it uses a single multiplication with nhel*nevt events (all helicities together) rather than nhel multiplications with nevt events (one helicity at a time)

The bottomline is that now BLAS is as good as kernels for ggttggg, and even better in d mode. But in simpler processes BLAS is still worse than kernels. So I would leave BLAS disabled by default.

(As for the results with cuda graphs, they are better than the ihel4 line, but still worse than all other options. I will add them to the other WIP MR on splitting Feynman diagrams).

Untitled

@Qubitol
Copy link
Collaborator

Qubitol commented Oct 15, 2025

@oliviermattelaer as you wish, #1052 would solve the issues with the few tests that do not pass, so if we merge this now, then I will merge the new master into the branch for #1052, check the tests again and we are good to also merge #1052 after this.
I think the order doesn't matter much at this point.

@valassi
Copy link
Member Author

valassi commented Oct 15, 2025

@oliviermattelaer about the order: both options are fine for me.

Just one (general) point: I would suggest to regenerate processes at every MR. Note in particular that the CI only runs GPU tests on generated code. So if you do not regenerate, you run the CI tests only for the C++ code. So specifically if you merge Daniele's nopatch MR first, then here I would also regenerate code.

@oliviermattelaer oliviermattelaer force-pushed the hack_ihel3_sep25_pr branch 2 times, most recently from 11c4d0e to 83c728d Compare October 21, 2025 09:35
@Qubitol
Copy link
Collaborator

Qubitol commented Oct 22, 2025

@oliviermattelaer, @valassi, everything seems fine now. If you agree, I'd suggest we merge this and we release.

@oliviermattelaer
Copy link
Member

Let's do it!!!
Thanks a lot @valassi !! (and @Qubitol for having regenerate the processes which looks like I'm not able to do correctly)

@oliviermattelaer oliviermattelaer merged commit 586b428 into madgraph5:master Oct 22, 2025
167 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants