gq_ttq HIP tests crash on AMD GPUs at LUMI (only in -O3 builds, while -O2 builds succeed)

While rerunning the full battery of tests on LUMI including those on AMD GPUs, there were several crashes (both in tput and tmad tests).

Example in `tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt`:
```
cmpExe /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/build.none_d_inl0_hrd0/gcheck.exe --common -p 2 64 2
cmpExe /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/build.none_d_inl0_hrd0/fgcheck.exe 2 64 2
Memory access fault by GPU node-4 (Agent handle: 0x693a290) on address 0x1460ca129000. Reason: Unknown.

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x146460faf372 in ???
#1  0x146460fae505 in ???
#2  0x14645f4a2dbf in ???
#3  0x14645f4a2d2b in ???
#4  0x14645f4a43e4 in ???
#5  0x146457975b64 in ???
#6  0x146457972b38 in ???
#7  0x146457930496 in ???
#8  0x14645f43c6e9 in ???
#9  0x14645f57049e in ???
#10  0xffffffffffffffff in ???
Avg ME (C++/CUDA)   = 
Avg ME (F77/CUDA)   = 
ERROR! Fortran calculation (F77/CUDA) crashed
```
and also
```
runExe /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/build.none_d_inl0_hrd0/runTest.exe
Memory access fault by GPU node-4 (Agent handle: 0x667850) on address 0x1454f3e09000. Reason: Unknown.
```

Example in `tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt`:
```
*** (3) EXECUTE MADEVENT_CUDA x1 (create events.lhe) ***
--------------------
CUDACPP_RUNTIME_FBRIDGEMODE = (not set)
CUDACPP_RUNTIME_VECSIZEUSED = 8192
--------------------
8192 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
1 ! Channel number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)
--------------------
Executing ' ./build.none_d_inl0_hrd0/madevent_cuda < /tmp/valassia/input_gqttq_x1_cudacpp > /tmp/valassia/output_gqttq_x1_cudacpp'
ERROR! ' ./build.none_d_inl0_hrd0/madevent_cuda < /tmp/valassia/input_gqttq_x1_cudacpp > /tmp/valassia/output_gqttq_x1_cudacpp' failed
 PDF set = nn23lo1
 alpha_s(Mz)= 0.1300 running at 2 loops.
 alpha_s(Mz)= 0.1300 running at 2 loops.
 Renormalization scale set on event-by-event basis
 Factorization   scale set on event-by-event basis


 getting user params
Enter number of events and max and min iterations: 
 Number of events and iterations         8192           1           1
```

This is strange and probably difficult to debug because it is specific to HIP and specific to gqttq:
- The same gqttq tests succeed on NVidia GPUs on itscrd90
- All tests but gqttq succeed on AMD GPUs on LUMI

I imagine that in any case this is not a blocker for PR #801. It is probably better to merge PR #801, also so that this code is readily available and can be tested. In any case the HIP stuff in PR #801 works for other physics processes so it is usable at least in some cases.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gq_ttq HIP tests crash on AMD GPUs at LUMI (only in -O3 builds, while -O2 builds succeed) #806

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

gq_ttq HIP tests crash on AMD GPUs at LUMI (only in -O3 builds, while -O2 builds succeed) #806

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions