Skip to content

Runner Registration Being Deleted In the Middle of Running Jobs #3748

@wagenet

Description

@wagenet

Checks

Controller Version

0.9.3

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

This happens consistently on our setup but I have no idea how to reproduce elsewhere.

Describe the bug

In the middle of jobs running they often are unexpectedly canceled. The job logs will show a line like Error: Process completed with exit code 1. which is often preceded by context canceled. The Workflow Summary Annotations will also contain:

The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

In the runner logs we see:

Failed to create a session. The runner registration has been deleted from the server, please re-configure. Runner registrations are automatically deleted for runners that have not connected to the service recently.

It is very surprising that this happens while jobs are actively running.

Describe the expected behavior

I would expect that runners not lose their registration while actively running jobs.

Additional Context

ghArcRunners:
  enabled: true
  appKeyExternalSecret:
    enabled: true
    secretStoreRef:
      kind: ClusterSecretStore
      name: aws-secrets-manager
    remoteRef:
      - key: eks/build/gh_runner
        property: github_app_id
      - key: eks/build/gh_runner
        property: github_app_installation_id
      - key: eks/build/gh_runner
        property: github_app_private_key
        base64decode: true
  dockerhubSecret:
    secretStoreRef:
      kind: ClusterSecretStore
      name: aws-secrets-manager
    dockerhubSecret:
      remoteRef:
        key: eks/build/dockerhub
        property: DOCKER_CONFIG_SECRET

api-2xlarge-runner-scale-set:
  enabled: true
  githubConfigUrl: "https://github.com/soxhub"
  githubConfigSecret: "gh-arc-runner-appkey"
  maxRunners: 200
  minRunners: 1
  runnerGroup: "api-2xlarge-runner-scale-set-group"
  runnerScaleSetName: "api-2xlarge-runner-scale-set-group"
  listenerTemplate:
    metadata:
      annotations:
        k8s.grafana.com/scrape: "true"
        k8s.grafana.com/job: "api-2xlarge-runner-scale-set-group"
    spec:
      containers:
      - name: listener
  template:
    metadata:
      labels:
        runner-scale-set-group: "api-2xlarge-runner-scale-set-group"
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
    spec:
      securityContext:
        fsGroup: 123
      initContainers:
      - name: init-dind-externals
        image: [SNIP].ecr.us-west-2.amazonaws.com/gh-runner-api:sha-e007c87
        command:
          ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
        volumeMounts:
          - name: dind-externals
            mountPath: /home/runner/tmpDir
      containers:
      - name: runner
        image: [SNIP].ecr.us-west-2.amazonaws.com/gh-runner-api:sha-e007c87
        command: ["/home/runner/run.sh"]
        env:
          - name: DOCKER_HOST
            value: unix:///run/docker/docker.sock
        resources:
          limits:
            memory: "64Gi"
          requests:
            memory: "16Gi"
            cpu: "8.0"
        volumeMounts:
          - mountPath: /home/runner/_work
            name: work
          - mountPath: /var/lib/docker
            name: var-lib-docker
          - name: dind-sock
            mountPath: /run/docker
            readOnly: true
      - name: docker
        image: [SNIP].ecr.us-west-2.amazonaws.com/dkr-hub/library/docker:dind
        args:
          - dockerd
          - --host=unix:///run/docker/docker.sock
          - --group=$(DOCKER_GROUP_GID)
        env:
          - name: DOCKER_GROUP_GID
            value: "123"
        securityContext:
            privileged: true
        volumeMounts:
          - mountPath: /home/runner/_work
            name: work
          - mountPath: /var/lib/docker
            name: var-lib-docker
          - name: dind-sock
            mountPath: /run/docker
          - name: dind-externals
            mountPath: /home/runner/externals
      volumes:
        - name: work
          ephemeral:
            volumeClaimTemplate:
              spec:
                accessModes: ["ReadWriteOnce"]
                resources:
                  requests:
                    storage: 30Gi
        - name: var-lib-docker
          ephemeral:
            volumeClaimTemplate:
              spec:
                accessModes: ["ReadWriteOnce"]
                resources:
                  requests:
                    storage: 30Gi
        - name: dind-sock
          emptyDir: {}
        - name: dind-externals
          emptyDir: {}
      nodeSelector:
        auditboard.com/nodegroup: 2xlarge-general
  controllerServiceAccount:
    namespace: gh-arc-runner
    name: gha-runner-scale-set-controller

gha-runner-scale-set-controller:
  enabled: true
  labels: {}

  metrics:
    serviceMonitor:
      enable: true

  replicaCount: 2

  image:
    repository: "ghcr.io/actions/gha-runner-scale-set-controller"
    pullPolicy: IfNotPresent
    tag: ""

  imagePullSecrets: []
  nameOverride: ""
  fullnameOverride: ""

  env:

  serviceAccount:
    create: true
    annotations: {}
    name: "gha-runner-scale-set-controller"

  podAnnotations:
    k8s.grafana.com/scrape: "true"

  podLabels: {}

  podSecurityContext: {}

  securityContext: {}

  resources: {}

  nodeSelector: {}

  tolerations: []

  affinity: {}

  volumes: []
  volumeMounts: []

  priorityClassName: ""

  metrics:
    controllerManagerAddr: ":8080"
    listenerAddr: ":8080"
    listenerEndpoint: "/metrics"

  flags:
    logLevel: "debug"
    logFormat: "text"

    updateStrategy: "immediate"

Controller Logs

https://gist.github.com/wagenet/ccae8e8a164e53587f978ccc53477772

Runner Pod Logs

https://gist.github.com/wagenet/65160702c38aada91cead50ece02c01a

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions