-
Notifications
You must be signed in to change notification settings - Fork 37
Disable hipcc optimizations i.e. use -O2 instead of -O3 (work around for gq_ttq crash 806 on AMD GPUs at LUMI) #1007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… (commented out) for the memory corruption madgraph5#806 This shows an uninitialised value deep inside hiprand [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > valgrind ./check_hip.exe -p 1 8 1 ==105499== Memcheck, a memory error detector ==105499== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==105499== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info ==105499== Command: ./check_hip.exe -p 1 8 1 ==105499== ==105499== Warning: set address range perms: large range [0x59c90000, 0x159e91000) (noaccess) INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW Get random numbers from Hiprand ==105499== Conditional jump or move depends on uninitialised value(s) ==105499== at 0x1253777C: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x12537F40: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x12540782: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x125629DD: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x4B825EB: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B88342: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B822FF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B55120: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B2B590: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x49D84AF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x49D87C4: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4A00FA2: hipMemcpy (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== ==105499== Conditional jump or move depends on uninitialised value(s) ==105499== at 0x12537B82: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x12537F40: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x12540782: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x125629DD: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x4B825EB: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B88342: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B822FF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B55120: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B2B590: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x49D84AF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x49D87C4: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4A00FA2: hipMemcpy (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== Got random numbers from Hiprand ==105499== Invalid read of size 8 ==105499== at 0x21F741: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== by 0x21D0D1: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== by 0x215CBB: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== Address 0x1c00000043 is not stack'd, malloc'd or (recently) free'd ==105499== ==105499== ==105499== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==105499== Access not within mapped region at address 0x1C00000043 ==105499== at 0x21F741: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== by 0x21D0D1: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== by 0x215CBB: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== If you believe this happened as a result of a stack ==105499== overflow in your program's main thread (unlikely but ==105499== possible), you can try to increase the size of the ==105499== main thread stack using the --main-stacksize= flag. ==105499== The main thread stack size used in this run was 16777216. Unfortunately however also --common crashes (and gives the same uninitialised problem, whether related or not)
…ad of HIP pinned host malloc to debug madgraph5#806 - still crashes, will revert This makes the valgrind 'conditional jump on uninitialised variable' disappear, but the crash from invalid memory reads still remains [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > valgrind --track-origins=yes ./check_hip.exe --common -p 1 8 1 ==10800== Memcheck, a memory error detector ==10800== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==10800== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info ==10800== Command: ./check_hip.exe --common -p 1 8 1 ==10800== ==10800== Warning: set address range perms: large range [0x59c90000, 0x159e91000) (noaccess) INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW ==10800== Invalid read of size 8 ==10800== at 0x21EF01: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== by 0x21CA21: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== by 0x2158A5: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== Address 0x140000003b is not stack'd, malloc'd or (recently) free'd ==10800== ==10800== ==10800== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==10800== Access not within mapped region at address 0x140000003B ==10800== at 0x21EF01: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== by 0x21CA21: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== by 0x2158A5: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== If you believe this happened as a result of a stack ==10800== overflow in your program's main thread (unlikely but ==10800== possible), you can try to increase the size of the ==10800== main thread stack using the --main-stacksize= flag. ==10800== The main thread stack size used in this run was 16777216. ==10800== ==10800== HEAP SUMMARY: ==10800== in use at exit: 4,784,824 bytes in 17,735 blocks ==10800== total heap usage: 306,364 allocs, 288,629 frees, 180,986,538 bytes allocated ==10800== ==10800== LEAK SUMMARY: ==10800== definitely lost: 256 bytes in 5 blocks ==10800== indirectly lost: 3,522 bytes in 64 blocks ==10800== possibly lost: 9,544 bytes in 80 blocks ==10800== still reachable: 4,771,502 bytes in 17,586 blocks ==10800== of which reachable via heuristic: ==10800== multipleinheritance: 384 bytes in 4 blocks ==10800== suppressed: 0 bytes in 0 blocks ==10800== Rerun with --leak-check=full to see details of leaked memory ==10800== ==10800== For lists of detected and suppressed errors, rerun with: -s ==10800== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0) Segmentation fault
…madgraph5#806 - now valgrind gives no invalid read, but there is a 'Memory access fault' Using valgrind [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > valgrind --track-origins=yes ./check_hip.exe --common -p 1 8 1 ==80385== Memcheck, a memory error detector ==80385== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==80385== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info ==80385== Command: ./check_hip.exe --common -p 1 8 1 ==80385== DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() exit ==80385== Warning: set address range perms: large range [0x59c90000, 0x159e91000) (noaccess) DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() retrieve '00 GpuInit' DEBUG: TimerMap::stop() exit INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW ... DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() retrieve '0e SGoodHel' DEBUG: TimerMap::stop() exit Memory access fault by GPU node-4 (Agent handle: 0x1417d4a0) on address 0xfffd862e5000. Reason: Unknown. ==80385== ==80385== Process terminating with default action of signal 6 (SIGABRT): dumping core ==80385== at 0x63D3D2B: raise (in /lib64/libc-2.31.so) ==80385== by 0x63D53E4: abort (in /lib64/libc-2.31.so) ==80385== by 0x12580D1B: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==80385== by 0x1257ABC8: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==80385== by 0x1252C9E6: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==80385== by 0x127C66E9: start_thread (in /lib64/libpthread-2.31.so) ==80385== by 0x64A150E: clone (in /lib64/libc-2.31.so) ==80385== ==80385== HEAP SUMMARY: ==80385== in use at exit: 4,790,652 bytes in 17,774 blocks ==80385== total heap usage: 306,424 allocs, 288,650 frees, 180,987,695 bytes allocated ==80385== ==80385== LEAK SUMMARY: ==80385== definitely lost: 184 bytes in 4 blocks ==80385== indirectly lost: 2,658 bytes in 52 blocks ==80385== possibly lost: 10,768 bytes in 86 blocks ==80385== still reachable: 4,777,042 bytes in 17,632 blocks ==80385== of which reachable via heuristic: ==80385== multipleinheritance: 496 bytes in 5 blocks ==80385== suppressed: 0 bytes in 0 blocks ==80385== Rerun with --leak-check=full to see details of leaked memory ==80385== ==80385== For lists of detected and suppressed errors, rerun with: -s ==80385== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) Aborted Using rocgdb [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > rocgdb --args ./check_hip.exe -p 1 8 1 GNU gdb (rocm-rel-6.0-131) 13.2 ... (gdb) run Starting program: /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe -p 1 8 1 ... DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() retrieve '0e SGoodHel' DEBUG: TimerMap::stop() exit New Thread 0x1554445ff700 (LWP 94651) New Thread 0x1555470b7700 (LWP 94652) Thread 0x1554445ff700 (LWP 94651) exited Warning: precise memory violation signal reporting is not enabled, reported location may not be accurate. See "show amdgpu precise-memory". Thread 6 "check_hip.exe" received signal SIGSEGV, Segmentation fault. [Switching to thread 6, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])] 0x0000155547130598 in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) () from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640 (gdb) where 0 0x0000155547130598 in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) () from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640 (gdb) l 1 ../sysdeps/x86_64/crtn.S: No such file or directory. ... (gdb) set amdgpu precise-memory (gdb) run The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe -p 1 8 1 ... DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() retrieve '0e SGoodHel' DEBUG: TimerMap::stop() exit New Thread 0x1554445ff700 (LWP 99032) New Thread 0x1555470b7700 (LWP 99033) Thread 0x1554445ff700 (LWP 99032) exited Thread 6 "check_hip.exe" received signal SIGSEGV, Segmentation fault. [Switching to thread 6, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])] 0x000015554713050c in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) () from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640 ... (gdb) info threads Id Target Id Frame 1 Thread 0x1555471dda80 (LWP 98983) "check_hip.exe" 0x0000155547603d57 in ?? () from /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1 2 Thread 0x1555469ff700 (LWP 99017) "check_hip.exe" 0x00001555538f64a7 in ioctl () from /lib64/libc.so.6 5 Thread 0x1555470b7700 (LWP 99033) "check_hip.exe" 0x000015554759fd04 in sem_post@@GLIBC_2.2.5 () from /lib64/libpthread.so.0 * 6 AMDGPU Wave 1:2:1:1 (0,0,0)/0 "check_hip.exe" 0x000015554713050c in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) () from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640
… in vxxxxx (which may explain why this only appears in gqttq?)
[valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > rocgdb --args ./check_hip.exe -p 1 8 1
GNU gdb (rocm-rel-6.0-131) 13.2
...
(gdb) set amdgpu precise-memory
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe -p 1 8 1
...
DEBUG: TimerMap::stop() enter
DEBUG: TimerMap::stop() retrieve '0e SGoodHel'
DEBUG: TimerMap::stop() exit
New Thread 0x1554445ff700 (LWP 1669)
New Thread 0x155547087700 (LWP 1670)
Thread 0x1554445ff700 (LWP 1669) exited
Thread 6 "check_hip.exe" received signal SIGSEGV, Segmentation fault.
[Switching to thread 6, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])]
mg5amcGpu::calculate_wavefunctions (ihel=<optimized out>, allmomenta=<optimized out>, allcouplings=<optimized out>,
allMEs=<optimized out>, channelId=<optimized out>, allNumerators=<optimized out>, allDenominators=<optimized out>,
jamp2_sv=<optimized out>) at CPPProcess.cc:328
328 vxxxxx<M_ACCESS, W_ACCESS>( momenta, 0., cHel[ihel][0], -1, w_fp[0], 0 );
(gdb) where
0 mg5amcGpu::calculate_wavefunctions (ihel=<optimized out>, allmomenta=<optimized out>, allcouplings=<optimized out>,
allMEs=<optimized out>, channelId=<optimized out>, allNumerators=<optimized out>, allDenominators=<optimized out>,
jamp2_sv=<optimized out>) at CPPProcess.cc:328
1 mg5amcGpu::sigmaKin (allmomenta=<optimized out>, allcouplings=<optimized out>, allrndhel=<optimized out>,
allrndcol=<optimized out>, allMEs=<optimized out>, allChannelIds=<optimized out>, allNumerators=<optimized out>,
allDenominators=<optimized out>, allselhel=<optimized out>, allselcol=<optimized out>) at CPPProcess.cc:1043
(gdb) info threads
Id Target Id Frame
1 Thread 0x1555471aea80 (LWP 1645) "check_hip.exe" 0x00001555475d5d57 in ?? ()
from /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1
2 Thread 0x1555469ff700 (LWP 1655) "check_hip.exe" 0x00001555538c84a7 in ioctl () from /lib64/libc.so.6
5 Thread 0x155547087700 (LWP 1670) "check_hip.exe" 0x00001555538c84a7 in ioctl () from /lib64/libc.so.6
* 6 AMDGPU Wave 1:2:1:1 (0,0,0)/0 "check_hip.exe" mg5amcGpu::calculate_wavefunctions (ihel=<optimized out>,
allmomenta=<optimized out>, allcouplings=<optimized out>, allMEs=<optimized out>, channelId=<optimized out>,
allNumerators=<optimized out>, allDenominators=<optimized out>, jamp2_sv=<optimized out>) at CPPProcess.cc:328
…d for debugging the crash madgraph5#806 in hipcc Revert "[amd] in gq_ttq.mad cudacpp.mk, enable -ggdb... the issue seems to be in vxxxxx (which may explain why this only appears in gqttq?)" This reverts commit 5cc62a6. Revert "[amd] in gq_ttq.mad timermap.h, add some debug printouts for the crash madgraph5#806 - now valgrind gives no invalid read, but there is a 'Memory access fault'" This reverts commit 5b8d92f. Revert "[amd] in gq_ttq.mad MemoryBuffers.h, temporarely use c++ malloc instead of HIP pinned host malloc to debug madgraph5#806 - still crashes, will revert" This reverts commit 007173a. Revert "[amd] in gq_ttq.mad HiprandRandomNumberKernel.cc, add debug printouts (commented out) for the memory corruption madgraph5#806" This reverts commit c7b3dc0.
…adgraph5#806 for HIPCC by disabling hipcc optimizations (use -O0 instead of -O3) The test now succeeds! ./check_hip.exe -p 1 8 1
…adgraph5#806 for HIPCC by disabling hipcc -O3, but keep -O2 (better than -O0) The test now still succeeds! ./check_hip.exe -p 1 8 1
|
I actually checked that -O2 is still ok, so I moved to that instead of -O0, this will be better for performance. But -O3 gives crashes, to be investigated in #806 |
…ead of -O3 (workaround for gq_ttq crash madgraph5#806)
|
Hi @oliviermattelaer can you please review? The ONLY change is that OPTFLAGS=-O2 is used instead of -O3 and ONLY for hip on AMD GPUs. Plus all processes are regenerated, and my usual tests will appear tomorrow after I run them. Thanks! |
…) - now they all succeed! gqttq crash madgraph5#806 has disappeared (Note: performance on HIP do not seem to be significantly degraded with -O2 with respect to -O3, eg on ggttgg) STARTED AT Thu 19 Sep 2024 06:24:53 PM EEST ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean -nocuda ENDED(1) AT Thu 19 Sep 2024 07:15:36 PM EEST [Status=0] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean -nocuda ENDED(2) AT Thu 19 Sep 2024 07:32:30 PM EEST [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean -nocuda ENDED(3) AT Thu 19 Sep 2024 07:41:44 PM EEST [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst -nocuda ENDED(4) AT Thu 19 Sep 2024 07:43:46 PM EEST [Status=0] SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common -nocuda' ENDED(5) AT Thu 19 Sep 2024 07:43:46 PM EEST [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common -nocuda ENDED(6) AT Thu 19 Sep 2024 07:45:46 PM EEST [Status=0] ./tput/teeThroughputX.sh -mix -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean -nocuda ENDED(7) AT Thu 19 Sep 2024 08:17:24 PM EEST [Status=0] No errors found in logs
…ds (madgraph5#806 fixed), all as expected (heft fail madgraph5#833, skip ggttggg madgraph5#933) (Note: performance on HIP do not seem to be significantly degraded with -O2 with respect to -O3, eg on ggttgg) STARTED AT Thu 19 Sep 2024 11:37:44 PM EEST (SM tests) ENDED(1) AT Fri 20 Sep 2024 02:00:00 AM EEST [Status=0] (BSM tests) ENDED(1) AT Fri 20 Sep 2024 02:08:55 AM EEST [Status=0] 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
Revert "[amd] rerun 30 tmad tests on LUMI against AMD GPUs - now gqttq succeeds (madgraph5#806 fixed), all as expected (heft fail madgraph5#833, skip ggttggg madgraph5#933)" This reverts commit 0d7d4cd. Revert "[amd] rerun 96 tput builds and tests on LUMI worker node (small-g 72h) - now they all succeed! gqttq crash madgraph5#806 has disappeared" This reverts commit e41c7ff.
…he getCompiler() function This gives for instance: [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > ./check_hip.exe -p 1 8 1 Process = SIGMA_SM_GUX_TTXUX_HIP [hipcc 6.0.32831 (clang 17.0.0)] [inlineHel=0] [hardcodePARAM=0] (Checked that all is ok when regenerating gq_ttq.mad/SubProcesses/P1_gux_ttxux)
|
Hi @oliviermattelaer this is now ready to be merged. I completed my tests and all looks good
En passant, now I also added the getCompiler() tag that decodes HIP version from defines. Ready to go for me. |
|
PS About the fact that there is no performance degradation |
|
Perfect, thanks Olivier |
|
Thanks @oliviermattelaer ! |
Disable hipcc optimizations i.e. use -O0 instead of -O3 (work around for gq_ttq crash #806)