Skip to content

Conversation

@valassi
Copy link
Member

@valassi valassi commented Sep 19, 2024

Disable hipcc optimizations i.e. use -O0 instead of -O3 (work around for gq_ttq crash #806)

… (commented out) for the memory corruption madgraph5#806

This shows an uninitialised value deep inside hiprand

[valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > valgrind ./check_hip.exe -p 1 8 1
==105499== Memcheck, a memory error detector
==105499== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==105499== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info
==105499== Command: ./check_hip.exe -p 1 8 1
==105499==
==105499== Warning: set address range perms: large range [0x59c90000, 0x159e91000) (noaccess)
INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
Get random numbers from Hiprand
==105499== Conditional jump or move depends on uninitialised value(s)
==105499==    at 0x1253777C: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==105499==    by 0x12537F40: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==105499==    by 0x12540782: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==105499==    by 0x125629DD: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==105499==    by 0x4B825EB: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x4B88342: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x4B822FF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x4B55120: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x4B2B590: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x49D84AF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x49D87C4: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x4A00FA2: hipMemcpy (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==
==105499== Conditional jump or move depends on uninitialised value(s)
==105499==    at 0x12537B82: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==105499==    by 0x12537F40: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==105499==    by 0x12540782: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==105499==    by 0x125629DD: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==105499==    by 0x4B825EB: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x4B88342: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x4B822FF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x4B55120: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x4B2B590: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x49D84AF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x49D87C4: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x4A00FA2: hipMemcpy (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==
Got random numbers from Hiprand
==105499== Invalid read of size 8
==105499==    at 0x21F741: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==105499==    by 0x21D0D1: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==105499==    by 0x215CBB: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==105499==  Address 0x1c00000043 is not stack'd, malloc'd or (recently) free'd
==105499==
==105499==
==105499== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==105499==  Access not within mapped region at address 0x1C00000043
==105499==    at 0x21F741: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==105499==    by 0x21D0D1: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==105499==    by 0x215CBB: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==105499==  If you believe this happened as a result of a stack
==105499==  overflow in your program's main thread (unlikely but
==105499==  possible), you can try to increase the size of the
==105499==  main thread stack using the --main-stacksize= flag.
==105499==  The main thread stack size used in this run was 16777216.

Unfortunately however also --common crashes (and gives the same uninitialised problem, whether related or not)
…ad of HIP pinned host malloc to debug madgraph5#806 - still crashes, will revert

This makes the valgrind 'conditional jump on uninitialised variable' disappear, but the crash from invalid memory reads still remains

[valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > valgrind --track-origins=yes ./check_hip.exe --common -p 1 8 1
==10800== Memcheck, a memory error detector
==10800== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==10800== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info
==10800== Command: ./check_hip.exe --common -p 1 8 1
==10800==
==10800== Warning: set address range perms: large range [0x59c90000, 0x159e91000) (noaccess)
INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
==10800== Invalid read of size 8
==10800==    at 0x21EF01: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==10800==    by 0x21CA21: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==10800==    by 0x2158A5: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==10800==  Address 0x140000003b is not stack'd, malloc'd or (recently) free'd
==10800==
==10800==
==10800== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==10800==  Access not within mapped region at address 0x140000003B
==10800==    at 0x21EF01: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==10800==    by 0x21CA21: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==10800==    by 0x2158A5: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==10800==  If you believe this happened as a result of a stack
==10800==  overflow in your program's main thread (unlikely but
==10800==  possible), you can try to increase the size of the
==10800==  main thread stack using the --main-stacksize= flag.
==10800==  The main thread stack size used in this run was 16777216.
==10800==
==10800== HEAP SUMMARY:
==10800==     in use at exit: 4,784,824 bytes in 17,735 blocks
==10800==   total heap usage: 306,364 allocs, 288,629 frees, 180,986,538 bytes allocated
==10800==
==10800== LEAK SUMMARY:
==10800==    definitely lost: 256 bytes in 5 blocks
==10800==    indirectly lost: 3,522 bytes in 64 blocks
==10800==      possibly lost: 9,544 bytes in 80 blocks
==10800==    still reachable: 4,771,502 bytes in 17,586 blocks
==10800==                       of which reachable via heuristic:
==10800==                         multipleinheritance: 384 bytes in 4 blocks
==10800==         suppressed: 0 bytes in 0 blocks
==10800== Rerun with --leak-check=full to see details of leaked memory
==10800==
==10800== For lists of detected and suppressed errors, rerun with: -s
==10800== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault
…madgraph5#806 - now valgrind gives no invalid read, but there is a 'Memory access fault'

Using valgrind
[valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > valgrind --track-origins=yes ./check_hip.exe --common -p 1 8 1
==80385== Memcheck, a memory error detector
==80385== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==80385== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info
==80385== Command: ./check_hip.exe --common -p 1 8 1
==80385==
DEBUG: TimerMap::stop() enter
DEBUG: TimerMap::stop() exit
==80385== Warning: set address range perms: large range [0x59c90000, 0x159e91000) (noaccess)
DEBUG: TimerMap::stop() enter
DEBUG: TimerMap::stop() retrieve '00 GpuInit'
DEBUG: TimerMap::stop() exit
INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
...
DEBUG: TimerMap::stop() enter
DEBUG: TimerMap::stop() retrieve '0e SGoodHel'
DEBUG: TimerMap::stop() exit
Memory access fault by GPU node-4 (Agent handle: 0x1417d4a0) on address 0xfffd862e5000. Reason: Unknown.
==80385==
==80385== Process terminating with default action of signal 6 (SIGABRT): dumping core
==80385==    at 0x63D3D2B: raise (in /lib64/libc-2.31.so)
==80385==    by 0x63D53E4: abort (in /lib64/libc-2.31.so)
==80385==    by 0x12580D1B: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==80385==    by 0x1257ABC8: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==80385==    by 0x1252C9E6: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==80385==    by 0x127C66E9: start_thread (in /lib64/libpthread-2.31.so)
==80385==    by 0x64A150E: clone (in /lib64/libc-2.31.so)
==80385==
==80385== HEAP SUMMARY:
==80385==     in use at exit: 4,790,652 bytes in 17,774 blocks
==80385==   total heap usage: 306,424 allocs, 288,650 frees, 180,987,695 bytes allocated
==80385==
==80385== LEAK SUMMARY:
==80385==    definitely lost: 184 bytes in 4 blocks
==80385==    indirectly lost: 2,658 bytes in 52 blocks
==80385==      possibly lost: 10,768 bytes in 86 blocks
==80385==    still reachable: 4,777,042 bytes in 17,632 blocks
==80385==                       of which reachable via heuristic:
==80385==                         multipleinheritance: 496 bytes in 5 blocks
==80385==         suppressed: 0 bytes in 0 blocks
==80385== Rerun with --leak-check=full to see details of leaked memory
==80385==
==80385== For lists of detected and suppressed errors, rerun with: -s
==80385== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Aborted

Using rocgdb
[valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > rocgdb --args ./check_hip.exe  -p 1 8 1
GNU gdb (rocm-rel-6.0-131) 13.2
...
(gdb) run
Starting program: /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe -p 1 8 1
...
DEBUG: TimerMap::stop() enter
DEBUG: TimerMap::stop() retrieve '0e SGoodHel'
DEBUG: TimerMap::stop() exit
New Thread 0x1554445ff700 (LWP 94651)
New Thread 0x1555470b7700 (LWP 94652)
Thread 0x1554445ff700 (LWP 94651) exited
Warning: precise memory violation signal reporting is not enabled, reported
location may not be accurate.  See "show amdgpu precise-memory".

Thread 6 "check_hip.exe" received signal SIGSEGV, Segmentation fault.
[Switching to thread 6, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])]
0x0000155547130598 in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) ()
   from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640
(gdb) where
0  0x0000155547130598 in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) ()
   from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640
(gdb) l
1       ../sysdeps/x86_64/crtn.S: No such file or directory.
...
(gdb) set amdgpu precise-memory
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe -p 1 8 1
...
DEBUG: TimerMap::stop() enter
DEBUG: TimerMap::stop() retrieve '0e SGoodHel'
DEBUG: TimerMap::stop() exit
New Thread 0x1554445ff700 (LWP 99032)
New Thread 0x1555470b7700 (LWP 99033)
Thread 0x1554445ff700 (LWP 99032) exited
Thread 6 "check_hip.exe" received signal SIGSEGV, Segmentation fault.
[Switching to thread 6, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])]
0x000015554713050c in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) ()
   from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640
...
(gdb) info threads
  Id   Target Id                                         Frame
  1    Thread 0x1555471dda80 (LWP 98983) "check_hip.exe" 0x0000155547603d57 in ?? ()
   from /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1
  2    Thread 0x1555469ff700 (LWP 99017) "check_hip.exe" 0x00001555538f64a7 in ioctl () from /lib64/libc.so.6
  5    Thread 0x1555470b7700 (LWP 99033) "check_hip.exe" 0x000015554759fd04 in sem_post@@GLIBC_2.2.5 ()
   from /lib64/libpthread.so.0
* 6    AMDGPU Wave 1:2:1:1 (0,0,0)/0 "check_hip.exe"     0x000015554713050c in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) ()
   from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640
… in vxxxxx (which may explain why this only appears in gqttq?)

[valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > rocgdb --args ./check_hip.exe  -p 1 8 1
GNU gdb (rocm-rel-6.0-131) 13.2
...
(gdb) set amdgpu precise-memory
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe -p 1 8 1
...
DEBUG: TimerMap::stop() enter
DEBUG: TimerMap::stop() retrieve '0e SGoodHel'
DEBUG: TimerMap::stop() exit
New Thread 0x1554445ff700 (LWP 1669)
New Thread 0x155547087700 (LWP 1670)
Thread 0x1554445ff700 (LWP 1669) exited
Thread 6 "check_hip.exe" received signal SIGSEGV, Segmentation fault.
[Switching to thread 6, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])]
mg5amcGpu::calculate_wavefunctions (ihel=<optimized out>, allmomenta=<optimized out>, allcouplings=<optimized out>,
    allMEs=<optimized out>, channelId=<optimized out>, allNumerators=<optimized out>, allDenominators=<optimized out>,
    jamp2_sv=<optimized out>) at CPPProcess.cc:328
328           vxxxxx<M_ACCESS, W_ACCESS>( momenta, 0., cHel[ihel][0], -1, w_fp[0], 0 );
(gdb) where
 0  mg5amcGpu::calculate_wavefunctions (ihel=<optimized out>, allmomenta=<optimized out>, allcouplings=<optimized out>,
    allMEs=<optimized out>, channelId=<optimized out>, allNumerators=<optimized out>, allDenominators=<optimized out>,
    jamp2_sv=<optimized out>) at CPPProcess.cc:328
 1  mg5amcGpu::sigmaKin (allmomenta=<optimized out>, allcouplings=<optimized out>, allrndhel=<optimized out>,
    allrndcol=<optimized out>, allMEs=<optimized out>, allChannelIds=<optimized out>, allNumerators=<optimized out>,
    allDenominators=<optimized out>, allselhel=<optimized out>, allselcol=<optimized out>) at CPPProcess.cc:1043
(gdb) info threads
  Id   Target Id                                        Frame
  1    Thread 0x1555471aea80 (LWP 1645) "check_hip.exe" 0x00001555475d5d57 in ?? ()
   from /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1
  2    Thread 0x1555469ff700 (LWP 1655) "check_hip.exe" 0x00001555538c84a7 in ioctl () from /lib64/libc.so.6
  5    Thread 0x155547087700 (LWP 1670) "check_hip.exe" 0x00001555538c84a7 in ioctl () from /lib64/libc.so.6
* 6    AMDGPU Wave 1:2:1:1 (0,0,0)/0 "check_hip.exe"    mg5amcGpu::calculate_wavefunctions (ihel=<optimized out>,
    allmomenta=<optimized out>, allcouplings=<optimized out>, allMEs=<optimized out>, channelId=<optimized out>,
    allNumerators=<optimized out>, allDenominators=<optimized out>, jamp2_sv=<optimized out>) at CPPProcess.cc:328
…d for debugging the crash madgraph5#806 in hipcc

Revert "[amd] in gq_ttq.mad cudacpp.mk, enable -ggdb... the issue seems to be in vxxxxx (which may explain why this only appears in gqttq?)"
This reverts commit 5cc62a6.

Revert "[amd] in gq_ttq.mad timermap.h, add some debug printouts for the crash madgraph5#806 - now valgrind gives no invalid read, but there is a 'Memory access fault'"
This reverts commit 5b8d92f.

Revert "[amd] in gq_ttq.mad MemoryBuffers.h, temporarely use c++ malloc instead of HIP pinned host malloc to debug madgraph5#806 - still crashes, will revert"
This reverts commit 007173a.

Revert "[amd] in gq_ttq.mad HiprandRandomNumberKernel.cc, add debug printouts (commented out) for the memory corruption madgraph5#806"
This reverts commit c7b3dc0.
@valassi valassi self-assigned this Sep 19, 2024
…adgraph5#806 for HIPCC by disabling hipcc optimizations (use -O0 instead of -O3)

The test now succeeds!
./check_hip.exe  -p 1 8 1
…adgraph5#806 for HIPCC by disabling hipcc -O3, but keep -O2 (better than -O0)

The test now still succeeds!
./check_hip.exe  -p 1 8 1
@valassi valassi changed the title Disable hipcc optimizations i.e. use -O0 instead of -O3 (work around for gq_ttq crash 806) Disable hipcc optimizations i.e. use -O2 instead of -O3 (work around for gq_ttq crash 806) Sep 19, 2024
@valassi
Copy link
Member Author

valassi commented Sep 19, 2024

I actually checked that -O2 is still ok, so I moved to that instead of -O0, this will be better for performance.

But -O3 gives crashes, to be investigated in #806

@valassi
Copy link
Member Author

valassi commented Sep 19, 2024

Hi @oliviermattelaer can you please review?

The ONLY change is that OPTFLAGS=-O2 is used instead of -O3 and ONLY for hip on AMD GPUs.

Plus all processes are regenerated, and my usual tests will appear tomorrow after I run them.

Thanks!

@valassi valassi changed the title Disable hipcc optimizations i.e. use -O2 instead of -O3 (work around for gq_ttq crash 806) Disable hipcc optimizations i.e. use -O2 instead of -O3 (work around for gq_ttq crash 806 on AMD GPUs at LUMI) Sep 19, 2024
…) - now they all succeed! gqttq crash madgraph5#806 has disappeared

(Note: performance on HIP do not seem to be significantly degraded with -O2 with respect to -O3, eg on ggttgg)

STARTED  AT Thu 19 Sep 2024 06:24:53 PM EEST
./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean  -nocuda
ENDED(1) AT Thu 19 Sep 2024 07:15:36 PM EEST [Status=0]
./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean  -nocuda
ENDED(2) AT Thu 19 Sep 2024 07:32:30 PM EEST [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean  -nocuda
ENDED(3) AT Thu 19 Sep 2024 07:41:44 PM EEST [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst  -nocuda
ENDED(4) AT Thu 19 Sep 2024 07:43:46 PM EEST [Status=0]
SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common  -nocuda'
ENDED(5) AT Thu 19 Sep 2024 07:43:46 PM EEST [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common  -nocuda
ENDED(6) AT Thu 19 Sep 2024 07:45:46 PM EEST [Status=0]
./tput/teeThroughputX.sh -mix -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean  -nocuda
ENDED(7) AT Thu 19 Sep 2024 08:17:24 PM EEST [Status=0]

No errors found in logs
…ds (madgraph5#806 fixed), all as expected (heft fail madgraph5#833, skip ggttggg madgraph5#933)

(Note: performance on HIP do not seem to be significantly degraded with -O2 with respect to -O3, eg on ggttgg)

STARTED  AT Thu 19 Sep 2024 11:37:44 PM EEST
(SM tests)
ENDED(1) AT Fri 20 Sep 2024 02:00:00 AM EEST [Status=0]
(BSM tests)
ENDED(1) AT Fri 20 Sep 2024 02:08:55 AM EEST [Status=0]

16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
Revert "[amd] rerun 30 tmad tests on LUMI against AMD GPUs - now gqttq succeeds (madgraph5#806 fixed), all as expected (heft fail madgraph5#833, skip ggttggg madgraph5#933)"
This reverts commit 0d7d4cd.

Revert "[amd] rerun 96 tput builds and tests on LUMI worker node (small-g 72h) - now they all succeed! gqttq crash madgraph5#806 has disappeared"
This reverts commit e41c7ff.
…he getCompiler() function

This gives for instance:
[valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > ./check_hip.exe  -p 1 8 1
Process = SIGMA_SM_GUX_TTXUX_HIP [hipcc 6.0.32831 (clang 17.0.0)] [inlineHel=0] [hardcodePARAM=0]

(Checked that all is ok when regenerating gq_ttq.mad/SubProcesses/P1_gux_ttxux)
@valassi
Copy link
Member Author

valassi commented Sep 20, 2024

Hi @oliviermattelaer this is now ready to be merged.

I completed my tests and all looks good

En passant, now I also added the getCompiler() tag that decodes HIP version from defines.

Ready to go for me.

@valassi
Copy link
Member Author

valassi commented Sep 20, 2024

PS About the fact that there is no performance degradation

git diff --no-ext-diff e41c7ff1a 0c947d1d5 tput/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
...
-DATE: 2024-09-19_19:08:34
+DATE: 2024-09-18_17:15:56
 
 On uan01 [CPU: AMD EPYC 7A53 64-Core Processor] [GPU: AMD INSTINCT MI200]:
 =========================================================================
@@ -30,36 +30,36 @@ INFO: The following Floating Point Exceptions will cause SIGFPE program aborts:
 Process                     = SIGMA_SM_GG_TTXGG_HIP [clang 17.0.0] [inlineHel=0] [hardcodePARAM=0]
 Workflow summary            = HIP:DBL+CXS:HIRDEV+RMBDEV+MESDEV/none+NAVBRK
 FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
-EvtsPerSec[Rmb+ME]     (23) = ( 1.208699e+05                 )  sec^-1
-EvtsPerSec[MatrixElems] (3) = ( 1.258349e+05                 )  sec^-1
-EvtsPerSec[MECalcOnly] (3a) = ( 1.258584e+05                 )  sec^-1
+EvtsPerSec[Rmb+ME]     (23) = ( 1.204596e+05                 )  sec^-1
+EvtsPerSec[MatrixElems] (3) = ( 1.259417e+05                 )  sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 1.259568e+05                 )  sec^-1
 MeanMatrixElemValue         = ( 3.804675e-02 +- 2.047289e-02 )  GeV^-4
-TOTAL       :     0.607053 sec
+TOTAL       :     0.715128 sec
 INFO: No Floating Point Exceptions have been reported
-     1,491,254,700      cycles:u                         #    2.509 GHz                      (75.96%)
-         2,542,523      stalled-cycles-frontend:u        #    0.17% frontend cycles idle     (75.44%)
-         5,449,723      stalled-cycles-backend:u         #    0.37% backend cycles idle      (75.46%)
-     1,937,303,891      instructions:u                   #    1.30  insn per cycle         
-                                                  #    0.00  stalled cycles per insn  (74.40%)
-       0.660947918 seconds time elapsed
+     1,635,437,665      cycles:u                         #    2.817 GHz                      (74.94%)
+         2,495,684      stalled-cycles-frontend:u        #    0.15% frontend cycles idle     (76.17%)
+         7,104,110      stalled-cycles-backend:u         #    0.43% backend cycles idle      (76.53%)
+     2,096,604,985      instructions:u                   #    1.28  insn per cycle         
+                                                  #    0.00  stalled cycles per insn  (74.05%)
+       0.853269091 seconds time elapsed

@oliviermattelaer
Copy link
Member

Perfect, thanks

Olivier

@valassi
Copy link
Member Author

valassi commented Sep 25, 2024

Thanks @oliviermattelaer !
Very good, merging

@valassi valassi merged commit f299290 into madgraph5:master Sep 25, 2024
169 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gq_ttq HIP tests crash on AMD GPUs at LUMI (only in -O3 builds, while -O2 builds succeed)

2 participants