Jobs randomly fail with exit code 137 on ubuntu-24.04 runner #169191

MykDmytrenko · 2025-08-08T09:30:00Z

MykDmytrenko
Aug 8, 2025

Why are you starting this discussion?

Question

What GitHub Actions topic or product is this about?

Misc

Discussion Details

We are experiencing intermittent failures in our GitHub Actions CI pipelines, where the job suddenly terminates with:

##[error]Process completed with exit code 137.

This exit code typically corresponds to SIGKILL, which usually indicates that the system forcibly killed the process — often due to resource pressure such as out-of-memory, cgroup limits, or CPU contention.

However, during the most recent failure, we were able to collect a full system diagnostic immediately after the error occurred, but before the job fully terminated, and none of the typical causes were observed.

Example log extract before failure:

2025-08-06T15:14:32.1347372Z AT-0553.2: Message
2025-08-06T15:14:33.2100900Z AT-0553.9: Delaying message
2025-08-06T15:14:33.2107368Z Device has been prepared
2025-08-06T15:14:36.2493020Z ##[error]Process completed with exit code 137.

Below is the full output of the diagnostic step captured right after the failure:

======================
DIAGNOSING JOB FAILURE
======================

=== Checking for OOM killer activity ===
No OOM info found in dmesg.

=== Memory usage snapshot ===
              total        used        free      shared  buff/cache   available
Mem:          7.8Gi       501Mi       1.8Gi        52Mi       5.5Gi       6.9Gi
Swap:         3.0Gi       0.0Ki       3.0Gi

=== Disk space ===
Filesystem      Size  Used Avail Use% Mounted on
overlay          72G   52G   20G  73% /
tmpfs            64M     0   64M   0% /dev
shm              64M     0   64M   0% /dev/shm
/dev/root        72G   52G   20G  73% /__w
tmpfs           1.6G  1.2M  1.6G   1% /run/docker.sock

=== CPU throttling stats ===
usage_usec 6171102267
user_usec 5342644226
system_usec 828458040
nice_usec 0
core_sched.force_idle_usec 0
nr_periods 0
nr_throttled 0
throttled_usec 0
nr_bursts 0
burst_usec 0

=== System load (uptime/load average) ===
 08:47:51 up  1:34,  0 users,  load average: 0.62, 0.90, 1.17

=== Top resource-consuming processes ===
Tasks:   6 total,   1 running,   3 sleeping,   0 stopped,   2 zombie
%Cpu(s):  3.3 us,  0.0 sy,  0.0 ni, 96.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   7944.3 total,   1819.1 free,    501.1 used,   5624.1 buff/cache
MiB Swap:   3072.0 total,   3072.0 free,      0.0 used.   7075.9 avail Mem 

    PID USER   %CPU  %MEM     TIME+ COMMAND
   1314 root    0.0   0.0  45:24.60 python (Z)
   1244 root    0.0   0.0   0:00.00 build_* (Z)

=== cgroup memory stats ===
memory.max: max
memory.current: 3527913472

=== Docker container stats (if inside docker) ===
Docker not available

=== Thermal throttling info (sensors) ===
lm-sensors not installed

=== Exit code info (if wrapped manually) ===
No custom exit code captured (e.g., /tmp/buildlogs/exit_code.txt missing)

Diagnostic step completed

Observations:

The system was not under memory pressure: only ~500 MiB used, 1.8 GiB free, 0 swap usage.
No CPU throttling was detected (nr_throttled = 0).
Disk space was sufficient, with 20 GiB free.
No OOM killer activity was found in dmesg.
Main processes turned zombie just before the crash — indicating an abrupt external kill (SIGKILL).

Environment Details:

Runner image: ubuntu-24.04
Runner version: 2.327.1
Image version: 20250728.1
Job uses container: Yes (Hosted GitHub Runner)
No timeout-minutes was set (defaults to 360 minutes)

Suspected Cause:
We believe this may be caused by resource contention on the shared GitHub-hosted runners — potentially due to multiple containers running in parallel on the same VM, triggering eviction or enforced limits from the host.

Even though the system was not overloaded, the termination suggests external intervention — possibly cgroup-level enforcement or memory overcommitment at the hypervisor level.

Request for Support:

Can you confirm whether this job was running on a throttled or overcommitted virtual machine?
Are there undocumented memory/CPU limits or eviction policies we might be hitting?
-Is there a way to detect host-level throttling or container eviction?
Are there any recommended best practices to reduce the chance of this type of failure on shared runners?

We appreciate any insight or guidance you can provide to help mitigate these unpredictable failures.

kavindus0 · 2025-08-08T13:04:02Z

kavindus0
Aug 8, 2025

Exit code 137 on shared runners - this is a known infrastructure challenge

Hi @MykDmytrenko! Your diagnostic work is excellent - you've ruled out the obvious causes and identified this as likely host-level resource management.

What's happening:

Host-level container eviction:

GitHub's shared runners use container orchestration (likely Kubernetes/Docker)
Your container gets SIGKILL from the host system, not the guest OS
This explains why your container diagnostics show no resource pressure

Common triggers on ubuntu-24.04:

Host memory pressure (not visible to your container)
CPU time limits at the hypervisor level
Container eviction due to noisy neighbors
Host-level cgroup OOMKiller intervention

Immediate mitigation strategies:

1. Add resource constraints to your workflow:

jobs:
  build:
    runs-on: ubuntu-24.04
    timeout-minutes: 30  # Explicit timeout
    steps:
      - name: Set memory limits
        run: |
          # Monitor memory more aggressively
          echo "Setting up memory monitoring..."
          ulimit -v 6000000  # Limit virtual memory

2. Implement retry logic:

- name: Run with retry
  uses: nick-fields/retry@v3
  with:
    timeout_minutes: 10
    max_attempts: 3
    command: your-failing-command

3. Switch to ubuntu-22.04 temporarily:

runs-on: ubuntu-22.04  # More stable, fewer reported issues

Advanced debugging:

1. Add pre-failure detection:

- name: Monitor system resources
  run: |
    # Run your command with system monitoring
    (
      while true; do
        echo "$(date): $(free -m | grep Mem:)"
        sleep 5
      done
    ) &
    MONITOR_PID=$!
    
    your-actual-command
    
    kill $MONITOR_PID

2. Capture host-level info:

- name: Host diagnostics
  run: |
    cat /proc/pressure/memory || true
    cat /proc/pressure/cpu || true
    systemctl status --no-pager || true

Reporting to GitHub:

This needs GitHub Support attention:

Contact: https://support.github.com/contact
Category: "Actions and Packages"
Include: Your diagnostic output, workflow YAML, timestamps
Request: Investigation of host-level resource limits on ubuntu-24.04

Community solutions working:

1. Self-hosted runners (if feasible):

runs-on: self-hosted

2. Matrix strategy to distribute load:

strategy:
  matrix:
    runner: [ubuntu-22.04, ubuntu-24.04]
  fail-fast: false

3. Smaller job segmentation:
Break large jobs into smaller, independent steps to reduce individual container pressure.

Root cause theory:

Your analysis is spot-on - this is likely host-level container orchestration killing containers that exceed invisible quotas or compete for resources with other tenants on the same VM.

The zombie processes are the smoking gun - they indicate an external SIGKILL, not internal resource exhaustion.

This is a GitHub infrastructure issue that needs escalation. Your diagnostic data is perfect evidence for their engineering team.

Many users are reporting similar issues with ubuntu-24.04 specifically - you're not alone in this!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

Jobs randomly fail with exit code 137 on ubuntu-24.04 runner #169191

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GitHub Community

Jobs randomly fail with exit code 137 on ubuntu-24.04 runner #169191

Uh oh!

MykDmytrenko Aug 8, 2025

Why are you starting this discussion?

What GitHub Actions topic or product is this about?

Discussion Details

Replies: 1 comment

Uh oh!

kavindus0 Aug 8, 2025

What's happening:

Immediate mitigation strategies:

Advanced debugging:

Reporting to GitHub:

Community solutions working:

Root cause theory:

MykDmytrenko
Aug 8, 2025

kavindus0
Aug 8, 2025