-
-
Notifications
You must be signed in to change notification settings - Fork 236
Open
Description
The exec_js_process subroutine inside the sandboxjs.py module uses two methods for deciding when a javascript expression tool has finished:
- It reads the end of the stderror and stdout streams, looking for a given string to indicate the end of the process
- A timer runs out, indicating that the tool has taken too long to complete, probably has failed, and so we should stop waiting
That timer is, currently, 20 seconds. And on one of our local HPC systems it seems that 20 seconds is not enough time to load the singularity container and run the javascript tool. This causes our workflow to fail. Increasing that time limit to 30 seconds (using a hardcoded value in the subroutine) solves the problem.
There are a number of issues here that I think need addressing:
- Currently, when tasks fail in this manner, there is no feedback given to the user as to what the cause of the failure is. This needs to be corrected, so that users are aware that they have encountered a time out failure. Also, when the process is closed due to the timer, rather than due to finding the "processed finished" string in stderr and stdout, the last characters of stderr and stdout probably shouldn't be removed, as this could delete important debugging information.
- What is a sensible time limit to use here? 20 seconds feels reasonable - but (as we've shown in our case here) is not enough where we are working with distributed filesystems. Presumably we need a low limit, so that not too much time is wasted when tools do fail, but would something like 60 seconds be more reasonable?
- Why is an error not thrown when a tool fails by hitting the time limit? Are there situations in which tools complete the required task without returning the proper "process finished" string at the end of stderr and stdout? If not then would it not be better for the workflow to end with an error for this step?
Metadata
Metadata
Assignees
Labels
No labels