-
Notifications
You must be signed in to change notification settings - Fork 37
Open
Description
While rerunning the full battery of tests on LUMI including those on AMD GPUs, there were several crashes (both in tput and tmad tests).
Example in tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt:
cmpExe /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/build.none_d_inl0_hrd0/gcheck.exe --common -p 2 64 2
cmpExe /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/build.none_d_inl0_hrd0/fgcheck.exe 2 64 2
Memory access fault by GPU node-4 (Agent handle: 0x693a290) on address 0x1460ca129000. Reason: Unknown.
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0 0x146460faf372 in ???
#1 0x146460fae505 in ???
#2 0x14645f4a2dbf in ???
#3 0x14645f4a2d2b in ???
#4 0x14645f4a43e4 in ???
#5 0x146457975b64 in ???
#6 0x146457972b38 in ???
#7 0x146457930496 in ???
#8 0x14645f43c6e9 in ???
#9 0x14645f57049e in ???
#10 0xffffffffffffffff in ???
Avg ME (C++/CUDA) =
Avg ME (F77/CUDA) =
ERROR! Fortran calculation (F77/CUDA) crashed
and also
runExe /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/build.none_d_inl0_hrd0/runTest.exe
Memory access fault by GPU node-4 (Agent handle: 0x667850) on address 0x1454f3e09000. Reason: Unknown.
Example in tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt:
*** (3) EXECUTE MADEVENT_CUDA x1 (create events.lhe) ***
--------------------
CUDACPP_RUNTIME_FBRIDGEMODE = (not set)
CUDACPP_RUNTIME_VECSIZEUSED = 8192
--------------------
8192 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
1 ! Channel number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)
--------------------
Executing ' ./build.none_d_inl0_hrd0/madevent_cuda < /tmp/valassia/input_gqttq_x1_cudacpp > /tmp/valassia/output_gqttq_x1_cudacpp'
ERROR! ' ./build.none_d_inl0_hrd0/madevent_cuda < /tmp/valassia/input_gqttq_x1_cudacpp > /tmp/valassia/output_gqttq_x1_cudacpp' failed
PDF set = nn23lo1
alpha_s(Mz)= 0.1300 running at 2 loops.
alpha_s(Mz)= 0.1300 running at 2 loops.
Renormalization scale set on event-by-event basis
Factorization scale set on event-by-event basis
getting user params
Enter number of events and max and min iterations:
Number of events and iterations 8192 1 1
This is strange and probably difficult to debug because it is specific to HIP and specific to gqttq:
- The same gqttq tests succeed on NVidia GPUs on itscrd90
- All tests but gqttq succeed on AMD GPUs on LUMI
I imagine that in any case this is not a blocker for PR #801. It is probably better to merge PR #801, also so that this code is readily available and can be tested. In any case the HIP stuff in PR #801 works for other physics processes so it is usable at least in some cases.