Skip to content

Commit 683d7f1

Browse files
committed
update KEP
1 parent 370a973 commit 683d7f1

File tree

2 files changed

+43
-23
lines changed

2 files changed

+43
-23
lines changed

keps/sig-storage/20190530-pv-health-monitor.md

Lines changed: 43 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ status: provisional
3838
* [External controller](#external-controller)
3939
* [External agent](#external-agent)
4040
* [Simple reactions](#simple-reactions)
41+
* [Alternative](#alternative)
4142
* [Implementation History](#implementation-history)
4243

4344

@@ -72,24 +73,25 @@ The main architecture is as below:
7273

7374
![pv health monitor architecture](./pv-health-monitor.png)
7475

75-
First of all, i want to note that we divide the PVs health condition checking into three cases
76+
First of all, i want to note that we mainly check three aspects at first:
7677

7778
- The health condition checking of PVs themselves, such as if the PV is deleted, if the usage is reaching the threshold...;
7879
- Attaching conditions checking;
7980
- Mounting conditions checking.
8081

81-
And in addition, we plan to create a service to receive PV health condition reports from other compoments deployed by users.
82+
And in addition, we plan to create a service to receive PV health condition reports from other compoments implemented and deployed by users.
8283

8384
Three main parts are involved here in the architecture.
8485

85-
- API change: we plan to introduce a new Taint called PVUnhealthTaint whose key is specific (PVUnhealthMessage) and value can be set differently.
86-
- External Controller: responsible for three tasks.
86+
- API change: we plan to use Annotation to mark PVs if they are unhealthy at the first stage.
87+
- External Controller:
88+
- Check if the network storage is still attached
8789
- Trigger controller RPC to check the health condition of network PVs themselves for network storage;
8890
- Watch for node failure events for both network and local storage;
8991
- Create HTTP(RPC) service to receive PVs health condition reports;
90-
- External Agent: responsible for two tasks.
91-
- Trigger node RPC to check PVs’ attaching and mounting conditions for network storage;
92-
- Since we want to check attaching per node in order to support multi-attach, put attaching check in node RPC here.
92+
93+
- External Agent:
94+
- Trigger node RPC to check PVs’ mounting conditions for network storage;
9395
- Trigger controller and node RPC(when ready) to check local PVs health condition for local storage;
9496
- For now, we do not have CSI support for local storage, we may check the local PVs directly by the agent at first, and then move the checks to RPC interfaces when ready.
9597

@@ -100,27 +102,20 @@ Three main parts are involved here in the architecture.
100102

101103
### API change
102104

103-
We plan to introduce a new Taint called PVUnhealthMessage for PV health condition whose key is specific (PVUnhealthMessage) and value can be set differently.
105+
At the first stage, we plan to use annotation to mark PVs if they are unhealthy.
104106

105-
For example, if the PV is not attached now, we can mark the PV using the PVUnhealthMessage taint like this:
106-
```
107-
Key: “PVUnhealthMessage”
108-
Value: “AttachError,the pv is not attached to node1 now”
109-
VolumeTaintEffect: NoEffect
110-
```
107+
Annotation key can be: `alpha.pv.monitor/unhealthy-messages` and value can be a json string containing all unhealthy details.
111108

112-
If the volume is deleted, we can mark the PV using the PVUnhealthMessage taint like this:
109+
For example:
113110
```
114-
Key: “PVUnhealthMessage”
115-
Value: “VolumeError, the volume is deleted from backend”
116-
VolumeTaintEffect: NoEffect
111+
Annotations:
112+
alpha.pv.monitor/unhealthy-messages: {
113+
"AttachError": "the pv is not attached to node1 now",
114+
...
115+
}
117116
```
118117

119-
Note that:
120-
121-
- all the VolumeTaintEffects are NoEffect now at first, we may talk about the reactions later in another proposal;
122-
- the taint Value is string now, it is theoretically possible that several errors are detected for one PV, we may extend the string to cover this situation: combine the errors together and splited by semicolon or other symbols.
123-
118+
We can also use PV Taints to mark PVs as an alternative, see the alternative section below.
124119

125120
### CSI change
126121

@@ -303,9 +298,34 @@ For now, check local PVs directly by the agent.
303298

304299
For unbound PVCs/PVs, we need to prevent binding tainted PVs to PVCs.
305300

301+
### Alternative
302+
303+
In addition to PV health annotation, we can also reuse the PV Taints and introduce a new Taint called PVUnhealthMessage for PV health condition whose key is specific (PVUnhealthMessage) and value can be set differently.
304+
305+
For example, if the PV is not attached now, we can mark the PV using the PVUnhealthMessage taint like this:
306+
```
307+
Key: “PVUnhealthMessage”
308+
Value: “AttachError,the pv is not attached to node1 now”
309+
VolumeTaintEffect: NoEffect
310+
```
311+
312+
If the volume is deleted, we can mark the PV using the PVUnhealthMessage taint like this:
313+
```
314+
Key: “PVUnhealthMessage”
315+
Value: “VolumeError, the volume is deleted from backend”
316+
VolumeTaintEffect: NoEffect
317+
```
318+
319+
Note that:
320+
321+
- all the VolumeTaintEffects are NoEffect now at first, we may talk about the reactions later in another proposal;
322+
- the taint Value is string now, it is theoretically possible that several errors are detected for one PV, we may extend the string to cover this situation: combine the errors together and splited by semicolon or other symbols.
323+
306324

307325
## Implementation History
308326

327+
- 20191021: KEP updated
328+
309329
- 20190730: KEP updated
310330

311331
- 20190530: KEP submitted
26.3 KB
Loading

0 commit comments

Comments
 (0)