ARC in Kubernetes mode issue with workflow nodes scheduling #160679

DenisPalnitsky · 2024-03-25T12:03:13Z

DenisPalnitsky
Mar 25, 2024

I'm testing ARC in Kubernetes mode, and I can't figure out how it could work reliably without failing jobs. Here's the problem I'm facing.

Input:

Kubernetes has nodes, and each node has a certain resource capacity. Based on that capacity, Kubernetes schedules pods on specific nodes.
To run a job, ARC creates two pods: a runner and a workflow (the one that actually runs the job). Both of these pods must have a single PV (Persistent Volume) assigned.

Imagine that we have small nodes that can only accommodate three containers. In that scenario, if ARC needs to schedule two jobs, it will schedule "Runner Pod 1" and "Workflow Pod 1" on the first node. Then, to run the second job, it will schedule "Runner Pod 2" on the first node, but the node will run out of capacity. Therefore, "Workflow Pod 2" cannot be scheduled on Node1 (due to no resources) and cannot be scheduled on Node2 because the PV is attached to Node1. This causes the job to fail.

With larger nodes, the situation may get even worse. Kubernetes can schedule multiple Runner Pods on one node, and there will be no capacity to schedule multiple workflow jobs there.

The fundamental problem is that when a job is scheduled, K8s needs to know in advance the resources that will be used by two pods (runner and worker), and there is no way to let the k8s scheduler know that in advance because the second pod is created by the first one.

Is there anything I'm missing that could address this issue?

DenisPalnitsky · 2024-05-06T08:49:53Z

DenisPalnitsky
May 6, 2024
Author

One of the ideas to address this issue and get rid of PVs is to push required data from runner pod to a workflow pod using cpToPod function. I tested this approach on workflows running in a custom container and did not find any issues.
@nikola-jokic I would love to hear you feedback on that and finalize the PR if the idea seems reasonable

0 replies

hawkesn · 2024-08-23T16:14:14Z

hawkesn
Aug 23, 2024

I am also running into this same issue with using Kubernetes mode. There are some alternative options I thought of (some are not GREAT though...):

Instead of sharing a PV that's mounted direct to the node (eg. EBS on AWS), share a PV that's network-attached (eg. EFS on AWS) so the workflow pod doesn't have to go on the same node. This would require the workflow pod to allow being scheduled 'anywhere' (or based on some affinity/nodeSelector).
You can put in a podAntiAffinity on the runner pod using topologyKey: <hostname>. This essentially means you have a dedicated node PER runner + workflow pod. This could be horribly inefficient if you've selected the wrong node size. You would need to force an appropriately sized node. (Eg. runner pod needs 2 cpu + workflow pod needs 8 = Use a node that has at least 10 CPU). This doesn't scale well organizationally either if you have runners of varying sizes, you would need to create a "node pool" per runner. This also doesn't scale well if you have a complex workflow (eg. matrix in your GHA workflow.yaml because it would need to provision more nodes per runner, and this breaks if the subsequent runners/workers have different compute/memory requests.)
runner and workflow in the same pod but as different containers. When it comes to pod scheduling, Kubernetes looks at the total requests of a given pod and it's containers to find an appropriately sized node. There won't be scheduling issues because the total size (ie. runner+workflow) is already known. It's also easier to share data between two containers in the same pod - you don't need to create a PV if it uses ephemeral disk and the containers have mounted each other's file paths. This approach would work even with subsequent runner+workflow spawns in matrix or otherwise since the relationship is always 1-1. This design is actually how the Kubernetes Plugin for Jenkins works.

If I'd had to chose an option, 3 would be ideal.

4 replies

towolf Sep 5, 2024

About 3, just getting started with GHA, so not quite firm on the details. The runner is already running and only picks up the job? And then a new container needs to be spawned?

Or can a fresh runner get spawned and include the extra container from the start?

What about EphemeralContainers?

hawkesn Sep 5, 2024

@towolf, 3 is an idea but not reality. The way GHA works now is there is a listener pod that checks a queue on GitHub's end for that particular runner-set (there was an architecture diagram somewhere but I can't find it). Once the listener pod finds it needs to scale up, the listener pod will launch a new workflow pod. If you are in DinD mode, the workflow pod will proceed to run your workflow via DinD. If you are in kubernetes mode, a new pod is spun up called runner. Therein lies the problem.

To answer your questions:

The runner is already running and only picks up the job?

No, the listener pod launches a workflow pod - which either DinDs the workflow or spawns a runner pod

Or can a fresh runner get spawned and include the extra container from the start?

Fresh runner gets spawned, does not include any extra containers (unless you set it up that way).

What about EphemeralContainers?

On the docs: Ephemeral containers differ from other containers in that they lack guarantees for resources or execution, and they will never be automatically restarted, so they are not appropriate for building applications. Is something I do not want my build containers to be running on 😆

DenisPalnitsky Sep 6, 2024
Author

@towolf

Instead of sharing a PV that's mounted direct to the node (eg. EBS on AWS), share a PV that's network-attached (eg. EFS on AWS)...

This option I did not try due to limitations of infrastructure that we run. You need to dynamically provision PVs with EFS which was a blocker for us due to internal reasons. In theory that should work however it's not clear how would scale.

You can put in a podAntiAffinity on the runner pod using topologyKey:....

With that approach, you quite often may find yourself in a situation when x runners are scheduled on a node and they take all the resources so that workflow could not be scheduled on it and that causes random jobs stack forever.

runner and workflow in the same pod but as different containers. When it comes to pod scheduling,....

As you mentioned, that's an idea and as far as I understand, it requires significant efforts to implement. Right now same runner software is running on a Selfhosted, Github-hosted, ARC and it works in a way that there must be listener for each runner and that listener manages execution of single job that it acquires. It would make sense for K8s to have a single listener that can spawn job pods based on demand and I don't see technical limitations for that. Someone just has to do it :D

4th option would be to not use PV and copy artefacts to workflow pod directly as suggested here

towolf Sep 6, 2024

On the docs: Ephemeral containers differ from other containers in that they lack guarantees for resources or execution, and they will never be automatically restarted, so they are not appropriate for building applications. Is something I do not want my build containers to be running on 😆

Well, we discussed this in my team and my point was that, firstly, resource requests could just be defined on the main container and it counts for the whole Pod. Secondly, restarts are not needed. Thirdly, the workflow container wouldn't need to declare containerPorts? So the limitations could be tolerable?

JordanP · 2024-11-26T20:11:28Z

JordanP
Nov 26, 2024

@DenisPalnitsky thanks for the nice description of the issue. I think most people solved this issue with ACTIONS_RUNNER_USE_KUBE_SCHEDULER env var set to "true" and a kubernetesModeWorkVolumeClaim (in Helm) set to a volume with accessModes: ["ReadWriteMany"]

Yet it;s not optimal: ReadWriteMany is either slow or expensive. So actions/runner-container-hooks#160 makes a lot of sense.

1 reply

LeonoreMangold Jun 6, 2025

Hello !
I tried to use this approach on my infrastructure, ie to switch to ReadWriteMany volumes and activate the ACTIONS_RUNNER_USE_KUBE_SCHEDULER option, to see what kind of result I can get with runner and workflow pods scheduled on different nodes. But actually I never reached that state, since ARC somehow put a nodeAffinity on the workflow pod to schedule it on the same node than the runner. I don't know why this nodeAffinity was there, because it prevents the scheduler to do his job like expected, but in fact the workflow pod was still forced to run on the same node.
Did someone also encountered this issue ?

2025-08-06T03:52:22Z

github-actions[bot]
bot Aug 6, 2025

🕒 Discussion Activity Reminder 🕒

This Discussion has been labeled as dormant by an automated system for having no activity in the last 60 days. Please consider one the following actions:

1️⃣ Close as Out of Date: If the topic is no longer relevant, close the Discussion as out of date at the bottom of the page.

2️⃣ Provide More Information: Share additional details or context — or let the community know if you've found a solution on your own.

3️⃣ Mark a Reply as Answer: If your question has been answered by a reply, mark the most helpful reply as the solution.

Note: This dormant notification will only apply to Discussions with the Question label. To learn more, see our recent announcement.

Thank you for helping bring this Discussion to a resolution! 💬

0 replies

LeonoreMangold · 2025-08-08T07:33:34Z

LeonoreMangold
Aug 8, 2025

Any update on this issue? We are having trouble using ARC Kubernetes mode reliably due to these limitations, this would really require an improvement to be fully usable in production.

0 replies

joshua-clayton · 2025-08-08T16:05:57Z

joshua-clayton
Aug 8, 2025

I'm not sure if I should open a second discussion. I have the workflow node spawn on the same node, so the access is fine, BUT the workflow pod doesn't reliably initialize because the node doesn't always have sufficient cpu overhead to spin up another pod, and karpenter doesn't have any visibility into the needs of the workflow pod, so it will schedule even if I know the second pod has no hope of starting. The result is sometimes it works, but a significant number of build fail at "Initializing container"

0 replies

ARC in Kubernetes mode issue with workflow nodes scheduling #160679

Uh oh!

Input:

Replies: 6 comments · 5 replies

Uh oh!

DenisPalnitsky May 6, 2024 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DenisPalnitsky Sep 6, 2024 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions[bot] bot Aug 6, 2025

Uh oh!

Uh oh!

Replies: 6 comments 5 replies

DenisPalnitsky
May 6, 2024
Author

DenisPalnitsky Sep 6, 2024
Author

github-actions[bot]
bot Aug 6, 2025