[SYCL] Lit hang when executing program that does not exist

### Describe the bug

Lit lines that try to run a program that does not exist will cause the execution of lit to hang. This only occurs with tests inside the `sycl/test-e2e` and `sycl/test` folders, and only when using one of our containers.

### To reproduce

1. Include a code snippet that is as short as possible
Add the following file to either `sycl/test-e2e` or `sycl/test`
```cpp
// RUN: non_existent
```
2. Specify the command which should be used to compile the program
3. Specify the command which should be used to launch the program
`llvm-lit ./test.cpp`
4. Indicate what is wrong and what was expected
The test itself should fail immediately, but lit will not immediately report the test statistics and quit. Instead it will hang until either you send an interrupt signal, or a timeout of 10mins is reached.


### Environment

- OS: Linux Container 
  - `ubuntu2204_intel_drivers:alldeps` (confirmed locally)
  - `ubuntu2204_intel_drivers:latest` (observed on ci)
 **NOTE:** Doesn't reproduce outside of a container, or in some other containers.
- DPC++ version: 28e84168b5e41bc5f4188b0218534058fbdec336


### Additional context

The hang occurs at the call to the `multiprocessing.Pool.join` method inside the lit implementation.
https://github.com/intel/llvm/blob/3565b587baefda3be1ee727a75c2c78476b4642f/llvm/utils/lit/lit/run.py#L87-L93
This seemingly happens if inside one of the processes a function raises an exception and it is not caught by the function who directly calls the throwing function. In our case this occurs in the `_executeShCmd` function https://github.com/intel/llvm/blob/3565b587baefda3be1ee727a75c2c78476b4642f/llvm/utils/lit/lit/TestRunner.py#L860-L861
The `executeShCmd` function calls `_executeShCmd`, however it does not catch this exception, rather this is caught in `executeScriptInternal` https://github.com/intel/llvm/blob/3565b587baefda3be1ee727a75c2c78476b4642f/llvm/utils/lit/lit/TestRunner.py#L1108-L1116
adding a try/except to the `executeShCmd` circumvents this hang. 

However it is unclear if this is actually an issue in upstream llvm, since this is not reproducible in either the `clang/test` or `llvm/test` folders (To be able to compare we need to set `useExternalSh` to false in the call of `lit.TestRunner._runShTest`), and this is only reproducible in our containers.

Probably related: https://stackoverflow.com/questions/15314189/python-multiprocessing-pool-hangs-at-join

Setting `useExternalSh` to true in the call of `lit.TestRunner._runShTest` also works as a workaround which is what is done in #16321 to avoid the hang. However this makes the test stdout less readable (all stdout is printed in one block, rather than separated by `RUN:` lines).

	try:
	self._wait_for(async_results, deadline)
	except:
	pool.terminate()
	raise
	finally:
	pool.join()

	try:
	shenv = ShellEnvironment(cwd, test.config.environment)
	exitCode, timeoutInfo = executeShCmd(
	cmd, shenv, results, timeout=litConfig.maxIndividualTestTime
	)
	except InternalShellError:
	e = sys.exc_info()[1]
	exitCode = 127
	results.append(ShellCommandResult(e.command, "", e.message, exitCode, False))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL] Lit hang when executing program that does not exist #16351

Describe the bug

To reproduce

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	if not executable:
	raise InternalShellError(j, "%r: command not found" % args[0])

[SYCL] Lit hang when executing program that does not exist #16351

Description

Describe the bug

To reproduce

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions