ARC in Kubernetes mode issue with workflow nodes scheduling #160679
Replies: 6 comments 5 replies
-
One of the ideas to address this issue and get rid of PVs is to push required data from |
Beta Was this translation helpful? Give feedback.
-
I am also running into this same issue with using Kubernetes mode. There are some alternative options I thought of (some are not GREAT though...):
If I'd had to chose an option, 3 would be ideal. |
Beta Was this translation helpful? Give feedback.
-
@DenisPalnitsky thanks for the nice description of the issue. I think most people solved this issue with Yet it;s not optimal: ReadWriteMany is either slow or expensive. So actions/runner-container-hooks#160 makes a lot of sense. |
Beta Was this translation helpful? Give feedback.
-
🕒 Discussion Activity Reminder 🕒 This Discussion has been labeled as dormant by an automated system for having no activity in the last 60 days. Please consider one the following actions: 1️⃣ Close as Out of Date: If the topic is no longer relevant, close the Discussion as 2️⃣ Provide More Information: Share additional details or context — or let the community know if you've found a solution on your own. 3️⃣ Mark a Reply as Answer: If your question has been answered by a reply, mark the most helpful reply as the solution. Note: This dormant notification will only apply to Discussions with the Thank you for helping bring this Discussion to a resolution! 💬 |
Beta Was this translation helpful? Give feedback.
-
Any update on this issue? We are having trouble using ARC Kubernetes mode reliably due to these limitations, this would really require an improvement to be fully usable in production. |
Beta Was this translation helpful? Give feedback.
-
I'm not sure if I should open a second discussion. I have the workflow node spawn on the same node, so the access is fine, BUT the workflow pod doesn't reliably initialize because the node doesn't always have sufficient cpu overhead to spin up another pod, and karpenter doesn't have any visibility into the needs of the workflow pod, so it will schedule even if I know the second pod has no hope of starting. The result is sometimes it works, but a significant number of build fail at "Initializing container" |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm testing ARC in Kubernetes mode, and I can't figure out how it could work reliably without failing jobs. Here's the problem I'm facing.
Input:
Imagine that we have small nodes that can only accommodate three containers. In that scenario, if ARC needs to schedule two jobs, it will schedule "Runner Pod 1" and "Workflow Pod 1" on the first node. Then, to run the second job, it will schedule "Runner Pod 2" on the first node, but the node will run out of capacity. Therefore, "Workflow Pod 2" cannot be scheduled on Node1 (due to no resources) and cannot be scheduled on Node2 because the PV is attached to Node1. This causes the job to fail.
With larger nodes, the situation may get even worse. Kubernetes can schedule multiple Runner Pods on one node, and there will be no capacity to schedule multiple workflow jobs there.
The fundamental problem is that when a job is scheduled, K8s needs to know in advance the resources that will be used by two pods (runner and worker), and there is no way to let the k8s scheduler know that in advance because the second pod is created by the first one.
Is there anything I'm missing that could address this issue?
Beta Was this translation helpful? Give feedback.
All reactions