- 
                Notifications
    You must be signed in to change notification settings 
- Fork 37
Kernel splitting ihel1/2/3: helicity streams, color sum kernel, color sum BLAS #1049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel splitting ihel1/2/3: helicity streams, color sum kernel, color sum BLAS #1049
Conversation
Revert "[hack_ihel] first results for hack_ihel - quite confusing, some failures, some faster, some slower" This reverts commit 717157c.
…functionally ok, faster for ggttg* ! (slower for ggtt and eemumu)
… code except gg_tt.sa, and also tput and CODEGEN, from last ancestor This should limit the conflicts to fix to gg_tt.sa I will then need to recreate CODEGEN from scratch, but I want to change gg_tt.sa first to make it closer to upstream/master (in particular, split again sigmakin_getgoodhel into two separate implementations) Note that all developments in hack_ihel in Nov2022 were made on gg_tt.sa (by AV) and on CODEGEN (by OM) base=$(git merge-base upstream/master HEAD) dirs=$(git ls-tree --name-only HEAD *.mad *.sa | grep -v gg_tt.sa) \rm -rf $dirs git checkout $base $dirs git add $dirs git checkout $base tput git checkout $base CODEGEN Note that the only differences to $base are in gg_tt.sa: git diff $base --name-only epochX/cudacpp/gg_tt.sa/CODEGEN_cudacpp_gg_tt_log.txt epochX/cudacpp/gg_tt.sa/SubProcesses/MatrixElementKernels.cc epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.h epochX/cudacpp/gg_tt.sa/src/mgOnGpuConfig.h
…goodhel CUDA/C++ implementations to ease the merging of upstream/master
…tgoodhel CUDA/C++ implementations to ease the merging of upstream/master
Fix conflicts: epochX/cudacpp/gg_tt.sa/CODEGEN_cudacpp_gg_tt_log.txt epochX/cudacpp/gg_tt.sa/SubProcesses/MatrixElementKernels.cc epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.h epochX/cudacpp/gg_tt.sa/src/mgOnGpuConfig.h NB the code does not build yet for CUDA - there are more fundamental issues to be fixed, e.g. in random color/helicity selection Instead it builds ok and the tests are successful for C++ (FPTYPE=d,f,m) -- Long description of changes added during the Nov 2022 hackathon (differences between upstream/hack_ihel and the merge base with upstream/master) I will also add something similar to CODEGEN CPPProcess.h - About sigmaKin_getGoodHel. This was a kernel (__global__), it is now a function inside which a kernel is launched. This had different signatures for C++ and CUDA (with/without nevt), it now has the same signature (with nevt). - About sigmaKin. This was a kernel (__global__), it is now a function inside which a kernel is launched. The signature is still different for C++ (with nevt) and CUDA (but here ngpublocks and ngputhreads has been added). - About normaliseoutput. This is a new kernel function (__global__), normalising allMEs from numerators and denominators. Note: for the moment this is only defined for CUDA. CPPProcess.cc - About cGoodHel and cNGooodHel. For CUDA these are no longer in GPU constant memory (replaced by static variables as in C++). This is because the loop over helicities is now done by a C++ function calling kernels, not within a kernel. - About calculate_wavefunctions. This was a __device__ function, it is now a kernel (__global__). - About sigmaKin_getGoodHel. This was a kernel (__global__), it is now a function inside which a kernel is launched. This had different signatures for C++ and CUDA (with/without nevt), it now has the same signature (with nevt). This had two completely separate implementations for C++ and CUDA, now it is the same single one with ifdefs inside. The CUDA implementation now calls calculate_wavefunction as a kernel, no longer as a device function. The CUDA implementation no longer copies the list of good helicities to the GPU constant memory. - About sigmaKin. This was a kernel (__global__), it is now a function inside which a kernel is launched. The signature is still different for C++ (with nevt) and CUDA (but here ngpublocks and ngputhreads has been added). The CUDA implementation now calls calculate_wavefunction as a kernel, no longer as a device function. The CUDA implementation now normalises allMEs from numerators and denominators through the normaliseoutput kernel. Note: for the moment the C++ implementation still normalises allMEs directly inside calculate_wavefunction. - About normaliseoutput. This is a new kernel function (__global__), normalising allMEs from numerators and denominators. Note: for the moment this is only defined for CUDA. MatrixElementKernels.cc - About MatrixElementKernelDevice::computeGoodHelicities. The devIsGoodHel data buffer is no longer needed (as helicities remain on the host). Here sigmakin_getGoodHel is called as a function and no longer as a kernel. - About MatrixElementKernelDevice::computeMatrixElements. Here sigmakin is called as a function and no longer as a kernel. Note that MGONGPU_NSIGHT_DEBUG is no longer supported (well it should be inside sigmaKin).
… (use the GPU abstraction layer) As in the previous commit: - This succeeds: "CUDA_HOME=none make; ./runTest.exe" - This fails: "make"
…h, check_sa.cc), WIP towards fixing cuda builds These changes essentially change the ME buffer from a simple array to an AOSOA (from 1 to ncomb+1 elements per event) However this now breaks C++ builds too - now this fails: "CUDA_HOME=none make -j"
…as "ihel" for consistency with the rest of the code
… files), fix "CUDA_HOME=none make" builds This completes the C++ build when moving ME buffers from 1 per event to ncomb+1 per event However runTest.exe for C++ is now failing (output ME is 0)
…ils, but at least now ME is not 0...
Not 100% clear to me why in Feb 2024 I went to one ME per helicity: functionality wise can start without that? try to revert...
…ers that include only the sum over helicities This essentially goes back to commit cabcb0d (26 Feb 2024): - This succeeds: "CUDA_HOME=none make; ./runTest.exe" - This fails: "make" Revert "[hack_ihel] more WIP in gg_tt.sa CPPProcess.cc - C++ runTest still fails" This reverts commit 97ea53c. Revert "[hack_ihel] more WIP in gg_tt.sa CPPProcess.cc - C++ runTest still fails, but at least now ME is not 0..." This reverts commit 98e3f58. Revert "[hack_ihel] in gg_tt.sa (MemoryAccessMatrixElements.h and various .cc files), fix "CUDA_HOME=none make" builds" This reverts commit fb4aac1. Revert "[hack_ihel] in gg_tt.sa MemoryAccessMatrixElements.h, rename "icomb" as "ihel" for consistency with the rest of the code" This reverts commit c65f089. Revert "[hack_ihel] in gg_tt.sa (MemoryAccessMatrixElements.h, MemoryBuffers.h, check_sa.cc), WIP towards fixing cuda builds" This reverts commit 982f0e9.
…pMatrixElementsAux), however runTest and gcheck both fail This is for runTest: [----------] 1 test from SIGMA_SM_GG_TTX_GPU/MadgraphTest [ RUN ] SIGMA_SM_GG_TTX_GPU/MadgraphTest.CompareMomentaAndME/0 INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt ERROR! assertGpu: 'an illegal memory access was encountered' (700) in MemoryBuffers.h:530 runTest.exe: GpuRuntime.h:26: void assertGpu(cudaError_t, const char*, int, bool): Assertion `code == gpuSuccess' failed. Aborted (core dumped) This is for gcheck (including gdb info): cuda-gdb --args ./gcheck.exe -p 1 8 1 Thread 1 "gcheck.exe" received signal CUDA_EXCEPTION_14, Warp Illegal Address. mg5amcGpu::calculate_wavefunctions<<<(1,1,1),(8,1,1)>>> () at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:288 288 jamp2_sv[ncolor * iParity + icolC] += cxabs2( jamp_sv[icolC] ); (Note: "CUDA_HOME=none make -j; ./runTest.exe" succeeds instead)
…ils with a segfault
…r of CPUTest and GPUTest This ensures that the auxiliary memory buffers for hel/col selection created in computeGoodHelicities are kept across iterations This fixes the segfault previously observed in runTest.exe: now both C++ and CUDA builds and tests are all ok in gg_tt.sa!
…for compatibility with HIP
…d keep it only in CPPProcess.cc
Fix conflicts: - epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/runTest.cc - epochX/cudacpp/gg_tt.sa/SubProcesses/MatrixElementKernels.cc - epochX/cudacpp/gg_tt.sa/SubProcesses/MatrixElementKernels.h - epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc - epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.h - epochX/cudacpp/gg_tt.sa/SubProcesses/runTest.cc - epochX/cudacpp/gg_tt.sa/src/mgOnGpuConfig.h Note: there are some build warnings in gg_tt.sa for cuda after the merge (maybe this is simply because in upstream/master warnings are enabled in the compiler?) ccache /usr/local/cuda-12.0/bin/nvcc -I. -I../../src -Xcompiler -O3 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -lineinfo -use_fast_math -Xcompiler -Wunused-parameter -I/usr/local/cuda-12.0/include/ -DUSE_NVTX -std=c++17 -ccbin /usr/lib64/ccache/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_FLOAT -Xcompiler -fPIC -DMGONGPU_CHANNELID_DEBUG -c -x cu CPPProcess.cc -o CPPProcess_cuda.o CPPProcess.cc(999): warning madgraph5#177-D: variable "ievt" was declared but never referenced Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" CPPProcess.cc: In function ‘void mg5amcGpu::sigmaKin(const fptype*, const fptype*, const fptype*, const fptype*, mgOnGpu::fptype*, int*, int*, int, int, mgOnGpu::fptype*, mgOnGpu::fptype*)’: CPPProcess.cc:963:1: warning: unused parameter ‘allrndcol’ [-Wunused-parameter] 962 | const fptype* allrndhel, // input: random numbers[nevt] for helicity selection | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 963 | const fptype* allrndcol, // input: random numbers[nevt] for color selection | ^ CPPProcess.cc:971:1: warning: unused parameter ‘allselcol’ [-Wunused-parameter] 970 | int* allselhel, // output: helicity selection[nevt] | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 971 | int* allselcol, // output: helicity selection[nevt] | ^
… a build warning (NB: a more comprehensive fix also covering allrndcol and and allselcol must be implemented in .mad)
…ming code generation
…PPProcess.cc) using one kernel per helicity This is based on work from Olivier and myself at the Lugano hackathon in Sep 2022, plus my later owrk in Feb and Oct 2024
… using one kernel per helicity This is based on work from Olivier and myself at the Lugano hackathon in Sep 2022, plus my later owrk in Feb and Oct 2024
…rnel per helicity The C++ builds and tests are ok The CUDA builds instead fail
… selection in CUDA when multichannel is enabled Now the builds succeed and check.exe returns the same average ME as C++. However runTest fails in the color selection: ME 8.466265863460649e-02 r.ME 8.466265911883228e-02 ChanId 1 r.ChanId 1 SelHel 8 r.SelHel 8 SelCol 1 r.SelCol 2
…for C++ and allJamp2s foor CUDA
… selection in CUDA - runTest still fails ME 7.095177923073470e-02 r.ME 7.095177937109977e-02 ChanId 1 r.ChanId 1 SelHel 12 r.SelHel 12 SelCol 1 r.SelCol 2
…fter helicity loop and before select_col - but runTest still fails
…r sum) into hack_ihel3_sep25_pr
…esses thanks to the fix by Olivier (But only in processes that I normally do not test like pp_tt012j and nobm_pp_ttW)
| Hi @oliviermattelaer again, voila, I included my latest changes. Ready for review again. I think that the only pending point is the color matrix codegen vs mg5amc. As discussed, I would merge now and then do an upgrade to a newer mg5amc in another MR. Up to you Olivier. Thanks! | 
…or cuda builds in no-multichannel mode
| There are some failures in the CI. I am debugging this. | 
85b63e8    to
    7568aa0      
    Compare
  
    …ll processes - bug fix for cuda builds in no-multichannel mode
7568aa0    to
    e0d061b      
    Compare
  
    | 
 Thanks Daniele! Indeed I have the impression that the failure has nothing to do with the cudacpp code itself. (Anyway it is actually good I tested manually - there was a bug not caught by the CI, as no .sa are tested with cuda) About the CI issue: if you suspect a problem in SIMD builds of googletest, maybe use no-SIMD as cudacpp/none? This is "-mno-sse". Just an idea... I can try it here if you want (or you can try it here too!) Andrea | 
| Thanks Andrea, indeed the issue was in googletest being built for nom compatible ISA. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I have checked with the latest version and (slightly surprisingly to me) no issue with the fact that I change the standard format for the color matrix (but this is also good news).
So I'm happy to approve.
Thanks for the additional options. I'm looking forward to see the updated plot with those.
@Qubitol, I guess we can merge this directly and then do another PR for the CI?
(in case then you can click on the merge button)
Or you prefer another order?
Thanks,
Olivier
| Hi @oliviermattelaer thanks for approving! The new plots for the color sums are here 
 The bottomline is that now BLAS is as good as kernels for ggttggg, and even better in d mode. But in simpler processes BLAS is still worse than kernels. So I would leave BLAS disabled by default. (As for the results with cuda graphs, they are better than the ihel4 line, but still worse than all other options. I will add them to the other WIP MR on splitting Feynman diagrams).   | 
| @oliviermattelaer as you wish, #1052 would solve the issues with the few tests that do not pass, so if we merge this now, then I will merge the new master into the branch for #1052, check the tests again and we are good to also merge #1052 after this. | 
| @oliviermattelaer about the order: both options are fine for me. Just one (general) point: I would suggest to regenerate processes at every MR. Note in particular that the CI only runs GPU tests on generated code. So if you do not regenerate, you run the CI tests only for the C++ code. So specifically if you merge Daniele's nopatch MR first, then here I would also regenerate code. | 
11c4d0e    to
    83c728d      
    Compare
  
    83c728d    to
    7c6e9ba      
    Compare
  
    | @oliviermattelaer, @valassi, everything seems fine now. If you agree, I'd suggest we merge this and we release. | 
Hi @oliviermattelaer, as discussed recently and as per my presentation at the MG5AMC meeting last Friday.
This is the PR for my kernel splitting changes that I recommend merging:
I have prepared a paper that will be shortly in arxiv with all the details.
Until yesterday, it would have been possible to merge this automatically, as I had merged the latest upstream in my developments. Yesterday there were some new changes merged (for tREX I think), so this will need some massaging. I can do that later on, or let me know how you want to proceed.
Thanks, Andrea