-
Notifications
You must be signed in to change notification settings - Fork 807
Description
Describe the bug
Lit lines that try to run a program that does not exist will cause the execution of lit to hang. This only occurs with tests inside the sycl/test-e2e
and sycl/test
folders, and only when using one of our containers.
To reproduce
- Include a code snippet that is as short as possible
Add the following file to eithersycl/test-e2e
orsycl/test
// RUN: non_existent
- Specify the command which should be used to compile the program
- Specify the command which should be used to launch the program
llvm-lit ./test.cpp
- Indicate what is wrong and what was expected
The test itself should fail immediately, but lit will not immediately report the test statistics and quit. Instead it will hang until either you send an interrupt signal, or a timeout of 10mins is reached.
Environment
- OS: Linux Container
ubuntu2204_intel_drivers:alldeps
(confirmed locally)ubuntu2204_intel_drivers:latest
(observed on ci)
NOTE: Doesn't reproduce outside of a container, or in some other containers.
- DPC++ version: 28e8416
Additional context
The hang occurs at the call to the multiprocessing.Pool.join
method inside the lit implementation.
llvm/llvm/utils/lit/lit/run.py
Lines 87 to 93 in 3565b58
try: | |
self._wait_for(async_results, deadline) | |
except: | |
pool.terminate() | |
raise | |
finally: | |
pool.join() |
This seemingly happens if inside one of the processes a function raises an exception and it is not caught by the function who directly calls the throwing function. In our case this occurs in the
_executeShCmd
function llvm/llvm/utils/lit/lit/TestRunner.py
Lines 860 to 861 in 3565b58
if not executable: | |
raise InternalShellError(j, "%r: command not found" % args[0]) |
The
executeShCmd
function calls _executeShCmd
, however it does not catch this exception, rather this is caught in executeScriptInternal
llvm/llvm/utils/lit/lit/TestRunner.py
Lines 1108 to 1116 in 3565b58
try: | |
shenv = ShellEnvironment(cwd, test.config.environment) | |
exitCode, timeoutInfo = executeShCmd( | |
cmd, shenv, results, timeout=litConfig.maxIndividualTestTime | |
) | |
except InternalShellError: | |
e = sys.exc_info()[1] | |
exitCode = 127 | |
results.append(ShellCommandResult(e.command, "", e.message, exitCode, False)) |
adding a try/except to the
executeShCmd
circumvents this hang.
However it is unclear if this is actually an issue in upstream llvm, since this is not reproducible in either the clang/test
or llvm/test
folders (To be able to compare we need to set useExternalSh
to false in the call of lit.TestRunner._runShTest
), and this is only reproducible in our containers.
Probably related: https://stackoverflow.com/questions/15314189/python-multiprocessing-pool-hangs-at-join
Setting useExternalSh
to true in the call of lit.TestRunner._runShTest
also works as a workaround which is what is done in #16321 to avoid the hang. However this makes the test stdout less readable (all stdout is printed in one block, rather than separated by RUN:
lines).