[RFC] Support for out-of-service taint

https://github.com/kubernetes/enhancements/pull/1116 is merged into k8s 1.24 as a beta feature.
And it will become a stable feature in k8s 1.26.

This KEP introduces a new "out-of-service" taint that ensures pods on the node will be forcefully
deleted and the volume detach operations on the node will happen immediately.
It means the deleted pods can recover quickly on different nodes.

This feature will have a good impact on the existing remediation mechanisms.
But there is a restriction that the user has to confirm the node is in 'NotReady' status
and the node is shutdown or in a non-recoverable state.

If the above conditions are not satisfied, pods on the unhealthy node are not deleted
and the volumeAttachment objects are not also removed.

There are several options to confirm the above conditions in SNR:

1. Fencing the node by using any management interface (e.g. IPMI etc.)
   => But SNR doesn't have the prerequisite according to https://www.medik8s.io/remediation/poison-pill/faq/#how-poison-pill-is-different-from-other-solutions
2. Intentional kernel panic
   => softdog supports the "soft_panic" option.
3. Soft power off
   => We can't trust soft power off when the node is unhealthy.
4. Other options?

Any comments are very welcome.
And if my understanding is wrong, please correct me.

For reference, the following flow is how NHC and SNR work:
```
1. When NHC detects a node failure, it creates a PoisonPillRemediation(PPR) object for the remediation.
2. SNR starts to remediate the node according to the PPR object.
3. SNR marks the unhealthy node as "SchedulingDisabled"
4. The unhealthy node reboots(*1)
   => We have three tiers(1. hardware watchdog, 2. softdog, 3. software reboot).
6. SNR waits for 180 seconds(default) before it starts deleting affected workloads like pods and volumeAttachment objects.
7. SNR takes a action according to the PPR object. We have "NodeDeletion" and "ResourceDeletion" as remediation
8. SNR deletes the Node Object(in case of the "NodeDeletion") or free the affected resources(in case of "ResourceDeletion").
9. After recovering the node, SNR unmarks the "SchedulingDisabled" flag from the unhealthy node.
   SNR restores the deleted Node Object(in case of the "NodeDeletion")

(*1)
SNR on a node will trigger reboot if
  1. it sees a remediation CR for itself
  2. or it can't connect to the API server AND
   2a) at least one peer SNR reports a remediation CR exists
   2b) or it fails to ask peers (in this case no one will delete any resources though, peers don't know about the problem)

It will NOT reboot if it can't connect to the API server AND
  1. at least one peer SNR reports there is NO remediation CR
  2. more than 50% of the peers which were contacted so far report API server issues as well (probably means the API server itself has an issue itself, or t\
here is a general network issue)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Support for out-of-service taint #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Support for out-of-service taint #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions