Skip to content

Flawed logic in ingress controller pod discovery #7047

@r0bobo

Description

@r0bobo

NGINX Ingress controller version: 0.45.0

Kubernetes version (use kubectl version): 1.18.9

Environment:

  • Cloud provider or hardware configuration: AWS EKS
  • OS (e.g. from /etc/os-release): Bottlerocket OS 1.0.5
  • Kernel (e.g. uname -a): Linux 5.4.80
  • Install tools: Ingress Nginx Helm Chart deployed with ArgoCD
  • Others: Ingress is running behind NLB with TLS Termination in NLB

What happened:

When we do a restart of the ingress controller (with kubectl rollout restart deployment) the controller incorrectly removes the address value (.status.loadBalancer.ingress[].hostname). The address is eventually added to the manifest again after ~30s.

This appeared after we upgraded from 0.41.2 -> 0.45.0.

What you expected to happen:

That the .status.loadBalancer.ingress[].hostname field would not be removed by the ingress controller on a restart.

How to reproduce it:

Install kind

Install the ingress controller

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v0.45.0/deploy/static/provider/baremetal/deploy.yaml

Install an application that will act as default backend (is just an echo app)

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/docs/examples/http-svc.yaml

Create an ingress (please add any additional annotation required)

echo "
  apiVersion: networking.k8s.io/v1beta1
  kind: Ingress
  metadata:
    name: foo-bar
  spec:
    rules:
    - host: foo.bar
      http:
        paths:
        - backend:
            serviceName: http-svc
            servicePort: 80
          path: /
" | kubectl apply -f -

Restart the ingress controller

Watch the ingress to see the address disappear and reappear:

watch -n 1 kubectl get ingress -A

Restart the controller

kubectl -n ingress-nginx rollout restart deployment ingress-nginx-controller

The following message also appears in the log of the current leader when it shuts down:

ingress-nginx/ingress-nginx-controller-8544f6fcc9-9jjgp[controller]: I0413 13:13:35.582145 7 status.go:132] "removing value from ingress status" address=[172.40.1.2]

Anything else we need to know:

We looked through the code and saw that there was a significant refactoring with the code that figures out if there are any other controller pod instances.
Before the refactor it listed pods based on the hard-coded labels app.kubernetes.io/component, app.kubernetes.io/instance, app.kubernetes.io/name to find other controller pods, but this changed to listing with a selector that is based on all labels assigned to the current pod. This means that the pod-template-hash label is also included in the selector so that the controller does not see the newly created pods and incorrectly assumes that there are no other replicas.
Then I guess it takes ~30s for them to fixed because we need to wait for a new leader to be elected.

/kind bug

Metadata

Metadata

Labels

kind/bugCategorizes issue or PR as related to a bug.priority/critical-urgentHighest priority. Must be actively worked on as someone's top priority right now.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions