-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Closed
Labels
bugSomething isn't workingSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainersRequires review from the maintainers
Description
Checks
- I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- I am using charts that are officially provided
Controller Version
0.9.3
Deployment Method
Helm
Checks
- This isn't a question or user support case (For Q&A and community support, go to Discussions).
- I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
This happens consistently on our setup but I have no idea how to reproduce elsewhere.
Describe the bug
In the middle of jobs running they often are unexpectedly canceled. The job logs will show a line like Error: Process completed with exit code 1.
which is often preceded by context canceled
. The Workflow Summary Annotations will also contain:
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
In the runner logs we see:
Failed to create a session. The runner registration has been deleted from the server, please re-configure. Runner registrations are automatically deleted for runners that have not connected to the service recently.
It is very surprising that this happens while jobs are actively running.
Describe the expected behavior
I would expect that runners not lose their registration while actively running jobs.
Additional Context
ghArcRunners:
enabled: true
appKeyExternalSecret:
enabled: true
secretStoreRef:
kind: ClusterSecretStore
name: aws-secrets-manager
remoteRef:
- key: eks/build/gh_runner
property: github_app_id
- key: eks/build/gh_runner
property: github_app_installation_id
- key: eks/build/gh_runner
property: github_app_private_key
base64decode: true
dockerhubSecret:
secretStoreRef:
kind: ClusterSecretStore
name: aws-secrets-manager
dockerhubSecret:
remoteRef:
key: eks/build/dockerhub
property: DOCKER_CONFIG_SECRET
api-2xlarge-runner-scale-set:
enabled: true
githubConfigUrl: "https://github.com/soxhub"
githubConfigSecret: "gh-arc-runner-appkey"
maxRunners: 200
minRunners: 1
runnerGroup: "api-2xlarge-runner-scale-set-group"
runnerScaleSetName: "api-2xlarge-runner-scale-set-group"
listenerTemplate:
metadata:
annotations:
k8s.grafana.com/scrape: "true"
k8s.grafana.com/job: "api-2xlarge-runner-scale-set-group"
spec:
containers:
- name: listener
template:
metadata:
labels:
runner-scale-set-group: "api-2xlarge-runner-scale-set-group"
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
spec:
securityContext:
fsGroup: 123
initContainers:
- name: init-dind-externals
image: [SNIP].ecr.us-west-2.amazonaws.com/gh-runner-api:sha-e007c87
command:
["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
volumeMounts:
- name: dind-externals
mountPath: /home/runner/tmpDir
containers:
- name: runner
image: [SNIP].ecr.us-west-2.amazonaws.com/gh-runner-api:sha-e007c87
command: ["/home/runner/run.sh"]
env:
- name: DOCKER_HOST
value: unix:///run/docker/docker.sock
resources:
limits:
memory: "64Gi"
requests:
memory: "16Gi"
cpu: "8.0"
volumeMounts:
- mountPath: /home/runner/_work
name: work
- mountPath: /var/lib/docker
name: var-lib-docker
- name: dind-sock
mountPath: /run/docker
readOnly: true
- name: docker
image: [SNIP].ecr.us-west-2.amazonaws.com/dkr-hub/library/docker:dind
args:
- dockerd
- --host=unix:///run/docker/docker.sock
- --group=$(DOCKER_GROUP_GID)
env:
- name: DOCKER_GROUP_GID
value: "123"
securityContext:
privileged: true
volumeMounts:
- mountPath: /home/runner/_work
name: work
- mountPath: /var/lib/docker
name: var-lib-docker
- name: dind-sock
mountPath: /run/docker
- name: dind-externals
mountPath: /home/runner/externals
volumes:
- name: work
ephemeral:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 30Gi
- name: var-lib-docker
ephemeral:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 30Gi
- name: dind-sock
emptyDir: {}
- name: dind-externals
emptyDir: {}
nodeSelector:
auditboard.com/nodegroup: 2xlarge-general
controllerServiceAccount:
namespace: gh-arc-runner
name: gha-runner-scale-set-controller
gha-runner-scale-set-controller:
enabled: true
labels: {}
metrics:
serviceMonitor:
enable: true
replicaCount: 2
image:
repository: "ghcr.io/actions/gha-runner-scale-set-controller"
pullPolicy: IfNotPresent
tag: ""
imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""
env:
serviceAccount:
create: true
annotations: {}
name: "gha-runner-scale-set-controller"
podAnnotations:
k8s.grafana.com/scrape: "true"
podLabels: {}
podSecurityContext: {}
securityContext: {}
resources: {}
nodeSelector: {}
tolerations: []
affinity: {}
volumes: []
volumeMounts: []
priorityClassName: ""
metrics:
controllerManagerAddr: ":8080"
listenerAddr: ":8080"
listenerEndpoint: "/metrics"
flags:
logLevel: "debug"
logFormat: "text"
updateStrategy: "immediate"
Controller Logs
https://gist.github.com/wagenet/ccae8e8a164e53587f978ccc53477772
Runner Pod Logs
https://gist.github.com/wagenet/65160702c38aada91cead50ece02c01a
shanesavoie, al-vovk, ali-kafel, batbattur, cmur2 and 4 more
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainersRequires review from the maintainers