Jobs randomly fail with exit code 137 on ubuntu-24.04 runner #169191
Replies: 1 comment
-
Exit code 137 on shared runners - this is a known infrastructure challenge Hi @MykDmytrenko! Your diagnostic work is excellent - you've ruled out the obvious causes and identified this as likely host-level resource management. What's happening:Host-level container eviction:
Common triggers on ubuntu-24.04:
Immediate mitigation strategies:1. Add resource constraints to your workflow: jobs:
build:
runs-on: ubuntu-24.04
timeout-minutes: 30 # Explicit timeout
steps:
- name: Set memory limits
run: |
# Monitor memory more aggressively
echo "Setting up memory monitoring..."
ulimit -v 6000000 # Limit virtual memory 2. Implement retry logic: - name: Run with retry
uses: nick-fields/retry@v3
with:
timeout_minutes: 10
max_attempts: 3
command: your-failing-command 3. Switch to ubuntu-22.04 temporarily: runs-on: ubuntu-22.04 # More stable, fewer reported issues Advanced debugging:1. Add pre-failure detection: - name: Monitor system resources
run: |
# Run your command with system monitoring
(
while true; do
echo "$(date): $(free -m | grep Mem:)"
sleep 5
done
) &
MONITOR_PID=$!
your-actual-command
kill $MONITOR_PID 2. Capture host-level info: - name: Host diagnostics
run: |
cat /proc/pressure/memory || true
cat /proc/pressure/cpu || true
systemctl status --no-pager || true Reporting to GitHub:This needs GitHub Support attention:
Community solutions working:1. Self-hosted runners (if feasible): runs-on: self-hosted 2. Matrix strategy to distribute load: strategy:
matrix:
runner: [ubuntu-22.04, ubuntu-24.04]
fail-fast: false 3. Smaller job segmentation: Root cause theory:Your analysis is spot-on - this is likely host-level container orchestration killing containers that exceed invisible quotas or compete for resources with other tenants on the same VM. The zombie processes are the smoking gun - they indicate an external SIGKILL, not internal resource exhaustion. This is a GitHub infrastructure issue that needs escalation. Your diagnostic data is perfect evidence for their engineering team. Many users are reporting similar issues with ubuntu-24.04 specifically - you're not alone in this! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Why are you starting this discussion?
Question
What GitHub Actions topic or product is this about?
Misc
Discussion Details
We are experiencing intermittent failures in our GitHub Actions CI pipelines, where the job suddenly terminates with:
This exit code typically corresponds to SIGKILL, which usually indicates that the system forcibly killed the process — often due to resource pressure such as out-of-memory, cgroup limits, or CPU contention.
However, during the most recent failure, we were able to collect a full system diagnostic immediately after the error occurred, but before the job fully terminated, and none of the typical causes were observed.
Example log extract before failure:
Below is the full output of the diagnostic step captured right after the failure:
Observations:
Environment Details:
Suspected Cause:
We believe this may be caused by resource contention on the shared GitHub-hosted runners — potentially due to multiple containers running in parallel on the same VM, triggering eviction or enforced limits from the host.
Even though the system was not overloaded, the termination suggests external intervention — possibly cgroup-level enforcement or memory overcommitment at the hypervisor level.
Request for Support:
-Is there a way to detect host-level throttling or container eviction?
We appreciate any insight or guidance you can provide to help mitigate these unpredictable failures.
Beta Was this translation helpful? Give feedback.
All reactions