You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -72,24 +73,25 @@ The main architecture is as below:
72
73
73
74

74
75
75
-
First of all, i want to note that we divide the PVs health condition checking into three cases
76
+
First of all, i want to note that we mainly check three aspects at first:
76
77
77
78
- The health condition checking of PVs themselves, such as if the PV is deleted, if the usage is reaching the threshold...;
78
79
- Attaching conditions checking;
79
80
- Mounting conditions checking.
80
81
81
-
And in addition, we plan to create a service to receive PV health condition reports from other compoments deployed by users.
82
+
And in addition, we plan to create a service to receive PV health condition reports from other compoments implemented and deployed by users.
82
83
83
84
Three main parts are involved here in the architecture.
84
85
85
-
- API change: we plan to introduce a new Taint called PVUnhealthTaint whose key is specific (PVUnhealthMessage) and value can be set differently.
86
-
- External Controller: responsible for three tasks.
86
+
- API change: we plan to use Annotation to mark PVs if they are unhealthy at the first stage.
87
+
- External Controller:
88
+
- Check if the network storage is still attached
87
89
- Trigger controller RPC to check the health condition of network PVs themselves for network storage;
88
90
- Watch for node failure events for both network and local storage;
89
91
- Create HTTP(RPC) service to receive PVs health condition reports;
90
-
- External Agent: responsible for two tasks.
91
-
- Trigger node RPC to check PVs’ attaching and mounting conditions for network storage;
92
-
- Since we want to check attaching per node in order to support multi-attach, put attaching check in node RPC here.
92
+
93
+
- External Agent:
94
+
- Trigger node RPC to check PVs’ mounting conditions for network storage;
93
95
- Trigger controller and node RPC(when ready) to check local PVs health condition for local storage;
94
96
- For now, we do not have CSI support for local storage, we may check the local PVs directly by the agent at first, and then move the checks to RPC interfaces when ready.
95
97
@@ -100,27 +102,20 @@ Three main parts are involved here in the architecture.
100
102
101
103
### API change
102
104
103
-
We plan to introduce a new Taint called PVUnhealthMessage for PV health condition whose key is specific (PVUnhealthMessage) and value can be set differently.
105
+
At the first stage, we plan to use annotation to mark PVs if they are unhealthy.
104
106
105
-
For example, if the PV is not attached now, we can mark the PV using the PVUnhealthMessage taint like this:
106
-
```
107
-
Key: “PVUnhealthMessage”
108
-
Value: “AttachError,the pv is not attached to node1 now”
109
-
VolumeTaintEffect: NoEffect
110
-
```
107
+
Annotation key can be: `alpha.pv.monitor/unhealthy-messages` and value can be a json string containing all unhealthy details.
111
108
112
-
If the volume is deleted, we can mark the PV using the PVUnhealthMessage taint like this:
109
+
For example:
113
110
```
114
-
Key: “PVUnhealthMessage”
115
-
Value: “VolumeError, the volume is deleted from backend”
116
-
VolumeTaintEffect: NoEffect
111
+
Annotations:
112
+
alpha.pv.monitor/unhealthy-messages: {
113
+
"AttachError": "the pv is not attached to node1 now",
114
+
...
115
+
}
117
116
```
118
117
119
-
Note that:
120
-
121
-
- all the VolumeTaintEffects are NoEffect now at first, we may talk about the reactions later in another proposal;
122
-
- the taint Value is string now, it is theoretically possible that several errors are detected for one PV, we may extend the string to cover this situation: combine the errors together and splited by semicolon or other symbols.
123
-
118
+
We can also use PV Taints to mark PVs as an alternative, see the alternative section below.
124
119
125
120
### CSI change
126
121
@@ -303,9 +298,34 @@ For now, check local PVs directly by the agent.
303
298
304
299
For unbound PVCs/PVs, we need to prevent binding tainted PVs to PVCs.
305
300
301
+
### Alternative
302
+
303
+
In addition to PV health annotation, we can also reuse the PV Taints and introduce a new Taint called PVUnhealthMessage for PV health condition whose key is specific (PVUnhealthMessage) and value can be set differently.
304
+
305
+
For example, if the PV is not attached now, we can mark the PV using the PVUnhealthMessage taint like this:
306
+
```
307
+
Key: “PVUnhealthMessage”
308
+
Value: “AttachError,the pv is not attached to node1 now”
309
+
VolumeTaintEffect: NoEffect
310
+
```
311
+
312
+
If the volume is deleted, we can mark the PV using the PVUnhealthMessage taint like this:
313
+
```
314
+
Key: “PVUnhealthMessage”
315
+
Value: “VolumeError, the volume is deleted from backend”
316
+
VolumeTaintEffect: NoEffect
317
+
```
318
+
319
+
Note that:
320
+
321
+
- all the VolumeTaintEffects are NoEffect now at first, we may talk about the reactions later in another proposal;
322
+
- the taint Value is string now, it is theoretically possible that several errors are detected for one PV, we may extend the string to cover this situation: combine the errors together and splited by semicolon or other symbols.
0 commit comments