Disable hipcc optimizations i.e. use -O2 instead of -O3 (work around for gq_ttq crash 806 on AMD GPUs at LUMI) #1007

valassi · 2024-09-19T15:02:00Z

Disable hipcc optimizations i.e. use -O0 instead of -O3 (work around for gq_ttq crash #806)

… (commented out) for the memory corruption madgraph5#806 This shows an uninitialised value deep inside hiprand [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > valgrind ./check_hip.exe -p 1 8 1 ==105499== Memcheck, a memory error detector ==105499== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==105499== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info ==105499== Command: ./check_hip.exe -p 1 8 1 ==105499== ==105499== Warning: set address range perms: large range [0x59c90000, 0x159e91000) (noaccess) INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW Get random numbers from Hiprand ==105499== Conditional jump or move depends on uninitialised value(s) ==105499== at 0x1253777C: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x12537F40: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x12540782: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x125629DD: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x4B825EB: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B88342: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B822FF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B55120: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B2B590: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x49D84AF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x49D87C4: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4A00FA2: hipMemcpy (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== ==105499== Conditional jump or move depends on uninitialised value(s) ==105499== at 0x12537B82: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x12537F40: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x12540782: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x125629DD: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x4B825EB: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B88342: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B822FF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B55120: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B2B590: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x49D84AF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x49D87C4: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4A00FA2: hipMemcpy (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== Got random numbers from Hiprand ==105499== Invalid read of size 8 ==105499== at 0x21F741: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== by 0x21D0D1: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== by 0x215CBB: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== Address 0x1c00000043 is not stack'd, malloc'd or (recently) free'd ==105499== ==105499== ==105499== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==105499== Access not within mapped region at address 0x1C00000043 ==105499== at 0x21F741: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== by 0x21D0D1: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== by 0x215CBB: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== If you believe this happened as a result of a stack ==105499== overflow in your program's main thread (unlikely but ==105499== possible), you can try to increase the size of the ==105499== main thread stack using the --main-stacksize= flag. ==105499== The main thread stack size used in this run was 16777216. Unfortunately however also --common crashes (and gives the same uninitialised problem, whether related or not)

…ad of HIP pinned host malloc to debug madgraph5#806 - still crashes, will revert This makes the valgrind 'conditional jump on uninitialised variable' disappear, but the crash from invalid memory reads still remains [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > valgrind --track-origins=yes ./check_hip.exe --common -p 1 8 1 ==10800== Memcheck, a memory error detector ==10800== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==10800== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info ==10800== Command: ./check_hip.exe --common -p 1 8 1 ==10800== ==10800== Warning: set address range perms: large range [0x59c90000, 0x159e91000) (noaccess) INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW ==10800== Invalid read of size 8 ==10800== at 0x21EF01: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== by 0x21CA21: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== by 0x2158A5: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== Address 0x140000003b is not stack'd, malloc'd or (recently) free'd ==10800== ==10800== ==10800== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==10800== Access not within mapped region at address 0x140000003B ==10800== at 0x21EF01: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== by 0x21CA21: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== by 0x2158A5: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== If you believe this happened as a result of a stack ==10800== overflow in your program's main thread (unlikely but ==10800== possible), you can try to increase the size of the ==10800== main thread stack using the --main-stacksize= flag. ==10800== The main thread stack size used in this run was 16777216. ==10800== ==10800== HEAP SUMMARY: ==10800== in use at exit: 4,784,824 bytes in 17,735 blocks ==10800== total heap usage: 306,364 allocs, 288,629 frees, 180,986,538 bytes allocated ==10800== ==10800== LEAK SUMMARY: ==10800== definitely lost: 256 bytes in 5 blocks ==10800== indirectly lost: 3,522 bytes in 64 blocks ==10800== possibly lost: 9,544 bytes in 80 blocks ==10800== still reachable: 4,771,502 bytes in 17,586 blocks ==10800== of which reachable via heuristic: ==10800== multipleinheritance: 384 bytes in 4 blocks ==10800== suppressed: 0 bytes in 0 blocks ==10800== Rerun with --leak-check=full to see details of leaked memory ==10800== ==10800== For lists of detected and suppressed errors, rerun with: -s ==10800== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0) Segmentation fault

…madgraph5#806 - now valgrind gives no invalid read, but there is a 'Memory access fault' Using valgrind [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > valgrind --track-origins=yes ./check_hip.exe --common -p 1 8 1 ==80385== Memcheck, a memory error detector ==80385== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==80385== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info ==80385== Command: ./check_hip.exe --common -p 1 8 1 ==80385== DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() exit ==80385== Warning: set address range perms: large range [0x59c90000, 0x159e91000) (noaccess) DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() retrieve '00 GpuInit' DEBUG: TimerMap::stop() exit INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW ... DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() retrieve '0e SGoodHel' DEBUG: TimerMap::stop() exit Memory access fault by GPU node-4 (Agent handle: 0x1417d4a0) on address 0xfffd862e5000. Reason: Unknown. ==80385== ==80385== Process terminating with default action of signal 6 (SIGABRT): dumping core ==80385== at 0x63D3D2B: raise (in /lib64/libc-2.31.so) ==80385== by 0x63D53E4: abort (in /lib64/libc-2.31.so) ==80385== by 0x12580D1B: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==80385== by 0x1257ABC8: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==80385== by 0x1252C9E6: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==80385== by 0x127C66E9: start_thread (in /lib64/libpthread-2.31.so) ==80385== by 0x64A150E: clone (in /lib64/libc-2.31.so) ==80385== ==80385== HEAP SUMMARY: ==80385== in use at exit: 4,790,652 bytes in 17,774 blocks ==80385== total heap usage: 306,424 allocs, 288,650 frees, 180,987,695 bytes allocated ==80385== ==80385== LEAK SUMMARY: ==80385== definitely lost: 184 bytes in 4 blocks ==80385== indirectly lost: 2,658 bytes in 52 blocks ==80385== possibly lost: 10,768 bytes in 86 blocks ==80385== still reachable: 4,777,042 bytes in 17,632 blocks ==80385== of which reachable via heuristic: ==80385== multipleinheritance: 496 bytes in 5 blocks ==80385== suppressed: 0 bytes in 0 blocks ==80385== Rerun with --leak-check=full to see details of leaked memory ==80385== ==80385== For lists of detected and suppressed errors, rerun with: -s ==80385== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) Aborted Using rocgdb [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > rocgdb --args ./check_hip.exe -p 1 8 1 GNU gdb (rocm-rel-6.0-131) 13.2 ... (gdb) run Starting program: /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe -p 1 8 1 ... DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() retrieve '0e SGoodHel' DEBUG: TimerMap::stop() exit New Thread 0x1554445ff700 (LWP 94651) New Thread 0x1555470b7700 (LWP 94652) Thread 0x1554445ff700 (LWP 94651) exited Warning: precise memory violation signal reporting is not enabled, reported location may not be accurate. See "show amdgpu precise-memory". Thread 6 "check_hip.exe" received signal SIGSEGV, Segmentation fault. [Switching to thread 6, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])] 0x0000155547130598 in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) () from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640 (gdb) where 0 0x0000155547130598 in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) () from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640 (gdb) l 1 ../sysdeps/x86_64/crtn.S: No such file or directory. ... (gdb) set amdgpu precise-memory (gdb) run The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe -p 1 8 1 ... DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() retrieve '0e SGoodHel' DEBUG: TimerMap::stop() exit New Thread 0x1554445ff700 (LWP 99032) New Thread 0x1555470b7700 (LWP 99033) Thread 0x1554445ff700 (LWP 99032) exited Thread 6 "check_hip.exe" received signal SIGSEGV, Segmentation fault. [Switching to thread 6, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])] 0x000015554713050c in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) () from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640 ... (gdb) info threads Id Target Id Frame 1 Thread 0x1555471dda80 (LWP 98983) "check_hip.exe" 0x0000155547603d57 in ?? () from /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1 2 Thread 0x1555469ff700 (LWP 99017) "check_hip.exe" 0x00001555538f64a7 in ioctl () from /lib64/libc.so.6 5 Thread 0x1555470b7700 (LWP 99033) "check_hip.exe" 0x000015554759fd04 in sem_post@@GLIBC_2.2.5 () from /lib64/libpthread.so.0 * 6 AMDGPU Wave 1:2:1:1 (0,0,0)/0 "check_hip.exe" 0x000015554713050c in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) () from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640

… in vxxxxx (which may explain why this only appears in gqttq?) [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > rocgdb --args ./check_hip.exe -p 1 8 1 GNU gdb (rocm-rel-6.0-131) 13.2 ... (gdb) set amdgpu precise-memory (gdb) run The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe -p 1 8 1 ... DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() retrieve '0e SGoodHel' DEBUG: TimerMap::stop() exit New Thread 0x1554445ff700 (LWP 1669) New Thread 0x155547087700 (LWP 1670) Thread 0x1554445ff700 (LWP 1669) exited Thread 6 "check_hip.exe" received signal SIGSEGV, Segmentation fault. [Switching to thread 6, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])] mg5amcGpu::calculate_wavefunctions (ihel=<optimized out>, allmomenta=<optimized out>, allcouplings=<optimized out>, allMEs=<optimized out>, channelId=<optimized out>, allNumerators=<optimized out>, allDenominators=<optimized out>, jamp2_sv=<optimized out>) at CPPProcess.cc:328 328 vxxxxx<M_ACCESS, W_ACCESS>( momenta, 0., cHel[ihel][0], -1, w_fp[0], 0 ); (gdb) where 0 mg5amcGpu::calculate_wavefunctions (ihel=<optimized out>, allmomenta=<optimized out>, allcouplings=<optimized out>, allMEs=<optimized out>, channelId=<optimized out>, allNumerators=<optimized out>, allDenominators=<optimized out>, jamp2_sv=<optimized out>) at CPPProcess.cc:328 1 mg5amcGpu::sigmaKin (allmomenta=<optimized out>, allcouplings=<optimized out>, allrndhel=<optimized out>, allrndcol=<optimized out>, allMEs=<optimized out>, allChannelIds=<optimized out>, allNumerators=<optimized out>, allDenominators=<optimized out>, allselhel=<optimized out>, allselcol=<optimized out>) at CPPProcess.cc:1043 (gdb) info threads Id Target Id Frame 1 Thread 0x1555471aea80 (LWP 1645) "check_hip.exe" 0x00001555475d5d57 in ?? () from /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1 2 Thread 0x1555469ff700 (LWP 1655) "check_hip.exe" 0x00001555538c84a7 in ioctl () from /lib64/libc.so.6 5 Thread 0x155547087700 (LWP 1670) "check_hip.exe" 0x00001555538c84a7 in ioctl () from /lib64/libc.so.6 * 6 AMDGPU Wave 1:2:1:1 (0,0,0)/0 "check_hip.exe" mg5amcGpu::calculate_wavefunctions (ihel=<optimized out>, allmomenta=<optimized out>, allcouplings=<optimized out>, allMEs=<optimized out>, channelId=<optimized out>, allNumerators=<optimized out>, allDenominators=<optimized out>, jamp2_sv=<optimized out>) at CPPProcess.cc:328

…d for debugging the crash madgraph5#806 in hipcc Revert "[amd] in gq_ttq.mad cudacpp.mk, enable -ggdb... the issue seems to be in vxxxxx (which may explain why this only appears in gqttq?)" This reverts commit 5cc62a6. Revert "[amd] in gq_ttq.mad timermap.h, add some debug printouts for the crash madgraph5#806 - now valgrind gives no invalid read, but there is a 'Memory access fault'" This reverts commit 5b8d92f. Revert "[amd] in gq_ttq.mad MemoryBuffers.h, temporarely use c++ malloc instead of HIP pinned host malloc to debug madgraph5#806 - still crashes, will revert" This reverts commit 007173a. Revert "[amd] in gq_ttq.mad HiprandRandomNumberKernel.cc, add debug printouts (commented out) for the memory corruption madgraph5#806" This reverts commit c7b3dc0.

…adgraph5#806 for HIPCC by disabling hipcc optimizations (use -O0 instead of -O3) The test now succeeds! ./check_hip.exe -p 1 8 1

…adgraph5#806 for HIPCC by disabling hipcc -O3, but keep -O2 (better than -O0) The test now still succeeds! ./check_hip.exe -p 1 8 1

valassi · 2024-09-19T15:12:39Z

I actually checked that -O2 is still ok, so I moved to that instead of -O0, this will be better for performance.

But -O3 gives crashes, to be investigated in #806

…ead of -O3 (workaround for gq_ttq crash madgraph5#806)

valassi · 2024-09-19T15:21:34Z

Hi @oliviermattelaer can you please review?

The ONLY change is that OPTFLAGS=-O2 is used instead of -O3 and ONLY for hip on AMD GPUs.

Plus all processes are regenerated, and my usual tests will appear tomorrow after I run them.

Thanks!

…) - now they all succeed! gqttq crash madgraph5#806 has disappeared (Note: performance on HIP do not seem to be significantly degraded with -O2 with respect to -O3, eg on ggttgg) STARTED AT Thu 19 Sep 2024 06:24:53 PM EEST ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean -nocuda ENDED(1) AT Thu 19 Sep 2024 07:15:36 PM EEST [Status=0] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean -nocuda ENDED(2) AT Thu 19 Sep 2024 07:32:30 PM EEST [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean -nocuda ENDED(3) AT Thu 19 Sep 2024 07:41:44 PM EEST [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst -nocuda ENDED(4) AT Thu 19 Sep 2024 07:43:46 PM EEST [Status=0] SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common -nocuda' ENDED(5) AT Thu 19 Sep 2024 07:43:46 PM EEST [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common -nocuda ENDED(6) AT Thu 19 Sep 2024 07:45:46 PM EEST [Status=0] ./tput/teeThroughputX.sh -mix -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean -nocuda ENDED(7) AT Thu 19 Sep 2024 08:17:24 PM EEST [Status=0] No errors found in logs

…ds (madgraph5#806 fixed), all as expected (heft fail madgraph5#833, skip ggttggg madgraph5#933) (Note: performance on HIP do not seem to be significantly degraded with -O2 with respect to -O3, eg on ggttgg) STARTED AT Thu 19 Sep 2024 11:37:44 PM EEST (SM tests) ENDED(1) AT Fri 20 Sep 2024 02:00:00 AM EEST [Status=0] (BSM tests) ENDED(1) AT Fri 20 Sep 2024 02:08:55 AM EEST [Status=0] 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt

Revert "[amd] rerun 30 tmad tests on LUMI against AMD GPUs - now gqttq succeeds (madgraph5#806 fixed), all as expected (heft fail madgraph5#833, skip ggttggg madgraph5#933)" This reverts commit 0d7d4cd. Revert "[amd] rerun 96 tput builds and tests on LUMI worker node (small-g 72h) - now they all succeed! gqttq crash madgraph5#806 has disappeared" This reverts commit e41c7ff.

…he getCompiler() function This gives for instance: [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > ./check_hip.exe -p 1 8 1 Process = SIGMA_SM_GUX_TTXUX_HIP [hipcc 6.0.32831 (clang 17.0.0)] [inlineHel=0] [hardcodePARAM=0] (Checked that all is ok when regenerating gq_ttq.mad/SubProcesses/P1_gux_ttxux)

…ler for HIP)

valassi · 2024-09-20T05:55:01Z

Hi @oliviermattelaer this is now ready to be merged.

I completed my tests and all looks good

There are no crashes gq_ttq HIP tests crash on AMD GPUs at LUMI (only in -O3 builds, while -O2 builds succeed) #806 in gq_ttq for HIP anymore (because -O2 is used instead of -O3)
In HIP tests on AMD GPUs, I see essentially no performance degradation from the fact that I use -O2 instead of -O3

En passant, now I also added the getCompiler() tag that decodes HIP version from defines.

Ready to go for me.

valassi · 2024-09-20T05:58:22Z

PS About the fact that there is no performance degradation

git diff --no-ext-diff e41c7ff1a 0c947d1d5 tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
...
-DATE: 2024-09-19_19:08:34
+DATE: 2024-09-18_17:15:56
 
 On uan01 [CPU: AMD EPYC 7A53 64-Core Processor] [GPU: AMD INSTINCT MI200]:
 =========================================================================
@@ -30,36 +30,36 @@ INFO: The following Floating Point Exceptions will cause SIGFPE program aborts:
 Process                     = SIGMA_SM_GG_TTXGG_HIP [clang 17.0.0] [inlineHel=0] [hardcodePARAM=0]
 Workflow summary            = HIP:DBL+CXS:HIRDEV+RMBDEV+MESDEV/none+NAVBRK
 FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
-EvtsPerSec[Rmb+ME]     (23) = ( 1.208699e+05                 )  sec^-1
-EvtsPerSec[MatrixElems] (3) = ( 1.258349e+05                 )  sec^-1
-EvtsPerSec[MECalcOnly] (3a) = ( 1.258584e+05                 )  sec^-1
+EvtsPerSec[Rmb+ME]     (23) = ( 1.204596e+05                 )  sec^-1
+EvtsPerSec[MatrixElems] (3) = ( 1.259417e+05                 )  sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 1.259568e+05                 )  sec^-1
 MeanMatrixElemValue         = ( 3.804675e-02 +- 2.047289e-02 )  GeV^-4
-TOTAL       :     0.607053 sec
+TOTAL       :     0.715128 sec
 INFO: No Floating Point Exceptions have been reported
-     1,491,254,700      cycles:u                         #    2.509 GHz                      (75.96%)
-         2,542,523      stalled-cycles-frontend:u        #    0.17% frontend cycles idle     (75.44%)
-         5,449,723      stalled-cycles-backend:u         #    0.37% backend cycles idle      (75.46%)
-     1,937,303,891      instructions:u                   #    1.30  insn per cycle         
-                                                  #    0.00  stalled cycles per insn  (74.40%)
-       0.660947918 seconds time elapsed
+     1,635,437,665      cycles:u                         #    2.817 GHz                      (74.94%)
+         2,495,684      stalled-cycles-frontend:u        #    0.15% frontend cycles idle     (76.17%)
+         7,104,110      stalled-cycles-backend:u         #    0.43% backend cycles idle      (76.53%)
+     2,096,604,985      instructions:u                   #    1.28  insn per cycle         
+                                                  #    0.00  stalled cycles per insn  (74.05%)
+       0.853269091 seconds time elapsed

…re merges

oliviermattelaer · 2024-09-24T20:02:43Z

Perfect, thanks

Olivier

valassi · 2024-09-25T13:04:14Z

Thanks @oliviermattelaer !
Very good, merging

valassi added 5 commits September 19, 2024 16:15

valassi self-assigned this Sep 19, 2024

valassi added 2 commits September 19, 2024 18:07

[amd] in gq_ttq.mad and CODEGEN, work around the memory access crash m…

44a6b1d

…adgraph5#806 for HIPCC by disabling hipcc optimizations (use -O0 instead of -O3) The test now succeeds! ./check_hip.exe -p 1 8 1

[amd] in gq_ttq.mad and CODEGEN, work around the memory access crash m…

3c2792a

…adgraph5#806 for HIPCC by disabling hipcc -O3, but keep -O2 (better than -O0) The test now still succeeds! ./check_hip.exe -p 1 8 1

valassi changed the title ~~Disable hipcc optimizations i.e. use -O0 instead of -O3 (work around for gq_ttq crash 806)~~ Disable hipcc optimizations i.e. use -O2 instead of -O3 (work around for gq_ttq crash 806) Sep 19, 2024

valassi mentioned this pull request Sep 19, 2024

gq_ttq HIP tests crash on AMD GPUs at LUMI (only in -O3 builds, while -O2 builds succeed) #806

Open

[amd] regenerate all processes, including OPTFLAGS=-O2 for hipcc inst…

f91c156

…ead of -O3 (workaround for gq_ttq crash madgraph5#806)

valassi force-pushed the amd branch from 16736a3 to f91c156 Compare September 19, 2024 15:19

valassi requested a review from oliviermattelaer September 19, 2024 15:20

valassi changed the title ~~Disable hipcc optimizations i.e. use -O2 instead of -O3 (work around for gq_ttq crash 806)~~ Disable hipcc optimizations i.e. use -O2 instead of -O3 (work around for gq_ttq crash 806 on AMD GPUs at LUMI) Sep 19, 2024

valassi added 5 commits September 20, 2024 08:09

[amd] ** COMPLETE AMD ** regenerate all processes (including getCompi…

11b5aa0

…ler for HIP)

valassi mentioned this pull request Sep 20, 2024

(WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA; still missing HIP) #978

Draft

[amd] move back to previous upstream/master codegen logs to ease futu…

0351d1c

…re merges

valassi merged commit f299290 into madgraph5:master Sep 25, 2024
169 checks passed

valassi linked an issue Sep 25, 2024 that may be closed by this pull request

gq_ttq HIP tests crash on AMD GPUs at LUMI (only in -O3 builds, while -O2 builds succeed) #806

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disable hipcc optimizations i.e. use -O2 instead of -O3 (work around for gq_ttq crash 806 on AMD GPUs at LUMI) #1007

Disable hipcc optimizations i.e. use -O2 instead of -O3 (work around for gq_ttq crash 806 on AMD GPUs at LUMI) #1007

Uh oh!

valassi commented Sep 19, 2024 •

edited

Loading

Uh oh!

valassi commented Sep 19, 2024

Uh oh!

valassi commented Sep 19, 2024

Uh oh!

valassi commented Sep 20, 2024

Uh oh!

valassi commented Sep 20, 2024

Uh oh!

oliviermattelaer commented Sep 24, 2024

Uh oh!

valassi commented Sep 25, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Disable hipcc optimizations i.e. use -O2 instead of -O3 (work around for gq_ttq crash 806 on AMD GPUs at LUMI) #1007

Disable hipcc optimizations i.e. use -O2 instead of -O3 (work around for gq_ttq crash 806 on AMD GPUs at LUMI) #1007

Uh oh!

Conversation

valassi commented Sep 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

valassi commented Sep 19, 2024

Uh oh!

valassi commented Sep 19, 2024

Uh oh!

valassi commented Sep 20, 2024

Uh oh!

valassi commented Sep 20, 2024

Uh oh!

oliviermattelaer commented Sep 24, 2024

Uh oh!

valassi commented Sep 25, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

valassi commented Sep 19, 2024 •

edited

Loading