|
| 1 | +# Meta |
| 2 | + |
| 3 | +| Field | Value | |
| 4 | +|----------------|----------------------------------------| |
| 5 | +| RFC Name | Faster Failover and Configuration Push | |
| 6 | +| RFC ID | 75 | |
| 7 | +| Start Date | 2023-06-14 | |
| 8 | +| Owner | Sergey Avseyev | |
| 9 | +| Current Status | DRAFT | |
| 10 | +| Revision | #1 | |
| 11 | + |
| 12 | +# Summary |
| 13 | + |
| 14 | +TBD |
| 15 | + |
| 16 | +# Motivation |
| 17 | + |
| 18 | +TBD |
| 19 | + |
| 20 | +# Relation to Other RFCs |
| 21 | + |
| 22 | +This RFC relates to the following documents: |
| 23 | + |
| 24 | +* [RFC-0005][rfc-0005]: VBucket Retry Logic. |
| 25 | + |
| 26 | +* [RFC-0024][rfc-0024]: Fast-Failover SDK. |
| 27 | + |
| 28 | + |
| 29 | +# High-Level Design |
| 30 | + |
| 31 | +TBD |
| 32 | + |
| 33 | +# User-Facing API |
| 34 | + |
| 35 | +TBD |
| 36 | + |
| 37 | +# Implementation Details |
| 38 | + |
| 39 | +## Protocol Changes |
| 40 | + |
| 41 | +[https://issues.couchbase.com/browse/MB-57311]: # |
| 42 | + |
| 43 | +### Get Cluster Config with Known Version |
| 44 | + |
| 45 | +[https://review.couchbase.org/c/kv_engine/+/192301]: # |
| 46 | + |
| 47 | +The KV engine introduces a new HELLO flag called `GetClusterConfigWithKnownVersion` with a value of `0x1d`. This flag |
| 48 | +does not change the behavior of the server but allows determining if the node supports epoch-revision fields for the |
| 49 | +`GetClusterConfig` (`0xb5`) operation. If the node acknowledges `GetClusterConfigWithKnownVersion`, then the SDK can use |
| 50 | +the new version of the command. |
| 51 | + |
| 52 | +Epoch and revision are signed 64-bit integers encoded in network (big-endian) order. |
| 53 | + |
| 54 | + |
| 55 | + Byte/ 0 | 1 | 2 | 3 | |
| 56 | + / | | | | |
| 57 | + |0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7| |
| 58 | + +---------------+---------------+---------------+---------------+ |
| 59 | + 0| 0x80 | 0xb5 | 0x00 | 0x00 | |
| 60 | + +---------------+---------------+---------------+---------------+ |
| 61 | + 4| 0x00 | 0x00 | 0x00 | 0x00 | |
| 62 | + +---------------+---------------+---------------+---------------+ |
| 63 | + 8| 0x00 | 0x00 | 0x00 | 0x00 | |
| 64 | + +---------------+---------------+---------------+---------------+ |
| 65 | + 12| 0xde | 0xad | 0xbe | 0xef | |
| 66 | + +---------------+---------------+---------------+---------------+ |
| 67 | + 16| 0x00 | 0x00 | 0x00 | 0x00 | |
| 68 | + +---------------+---------------+---------------+---------------+ |
| 69 | + 20| 0x00 | 0x00 | 0x00 | 0x00 | |
| 70 | + +---------------+---------------+---------------+---------------+ |
| 71 | + 24| 0x42 | 0x00 | 0x00 | 0x00 | |
| 72 | + +---------------+---------------+---------------+---------------+ |
| 73 | + 28| 0x00 | 0x00 | 0x00 | 0x00 | |
| 74 | + +---------------+---------------+---------------+---------------+ |
| 75 | + 32| 0x00 | 0x00 | 0x00 | 0x00 | |
| 76 | + +---------------+---------------+---------------+---------------+ |
| 77 | + 36| 0x08 | 0x07 | 0x06 | 0x05 | |
| 78 | + +---------------+---------------+---------------+---------------+ |
| 79 | + 40| 0x04 | 0x03 | 0x02 | 0x01 | |
| 80 | + +---------------+---------------+---------------+---------------+ |
| 81 | + GET_CLUSTER_CONFIG command |
| 82 | + Field (offset) (value) |
| 83 | + Magic (0) : 0x80 (client request, SDK -> kv_engine) |
| 84 | + Opcode (1) : 0xb5 |
| 85 | + Key length (2,3) : 0x0000 |
| 86 | + Extra length (4) : 0x00 |
| 87 | + Data type (5) : 0x00 (RAW) |
| 88 | + Vbucket (6,7) : 0x0000 |
| 89 | + Total body (8-11) : 0x00000010 (16 bytes) |
| 90 | + Opaque (12-15): 0xdeadbeef |
| 91 | + CAS (16-23): 0x0000000000000000 |
| 92 | + Epoch (24-31): 0x0000000000000042 (66 in base-10) |
| 93 | + Revision (32-39): 0x0102030405060708 (72623859790382856 in base-10) |
| 94 | + |
| 95 | +If the node has a cluster configuration newer than what is specified in the example, the response will include the new |
| 96 | +configuration in the body with the data type set to `JSON` (`0x01`). Otherwise, the response will have an empty body |
| 97 | +with the data type `RAW` (`0x00`). |
| 98 | + |
| 99 | +### Deduplicate Cluster Configuration for `NotMyVbucket` Responses |
| 100 | + |
| 101 | +[https://review.couchbase.org/c/kv_engine/+/190899]: # |
| 102 | + |
| 103 | +The KV engine introduces a new HELLO flag called `DedupeNotMyVbucketClustermap` with a value of `0x1e`. Once this flag |
| 104 | +is negotiated, the node might send an empty body with `NotMyVbucket` (`0x07`) status codes. The KV engine tracks the |
| 105 | +revision that has been sent to the SDK over the socket connection, so a response with a `NotMyVbucket` status will only |
| 106 | +have a body if the pushed version is older than the active configuration. |
| 107 | + |
| 108 | +The KV engine updates the pushed configuration version in the following cases: |
| 109 | +* Configuration sent to the SDK in response to a `GetClusterConfig` (`0xb5`) request. |
| 110 | +* Configuration pushed to the SDK that enabled the HELLO flag `ClustermapChangeNotification` (`0x0d`). |
| 111 | + |
| 112 | +Note, that `DedupeNotMyVbucketClustermap` affects `ClustermapChangeNotification` and `ClustermapChangeNotificationBrief` |
| 113 | +features, that described below. In other words, if deduplication enabled, the cluster configuration will be announce for |
| 114 | +the socket connection only once. |
| 115 | + |
| 116 | +### Enforcing Snappy Compression for Cluster Configuration Payloads |
| 117 | + |
| 118 | +[https://review.couchbase.org/c/kv_engine/+/192152]: # |
| 119 | +[https://review.couchbase.org/c/kv_engine/+/192316]: # |
| 120 | + |
| 121 | +The KV engine introduces a new HELLO flag called `SnappyEverywhere` with a value of `0x13`. Once this flag is |
| 122 | +negotiated, the node will always use the compressed version of the cluster configuration and data type flags will be set |
| 123 | +to `JSON | SNAPPY` (`0x03`). |
| 124 | + |
| 125 | +### `GetClusterConfig` and Out-of-Order Execution |
| 126 | + |
| 127 | +[https://issues.couchbase.com/browse/MB-56885]: # |
| 128 | + |
| 129 | +HELLO flag `UnorderedExecution` (`0x0e`) enables Out-of-Order (OoO) execution, so that the KV engine is being allowed to |
| 130 | +reorder operations. [kv\_engine/docs/UnorderedExecution.md][kv-unordered-execution] provides more details on this |
| 131 | +feature. |
| 132 | + |
| 133 | +The `GetClusterConfig` (`0xb5`) command is explicitly marked as compatible with OoO execution, allowing it to be served |
| 134 | +without waiting for the completion of in-flight operations. Specifically, `GetClusterConfig` will not wait for long |
| 135 | +operations such as mutations with SyncDurability requirements. All current SDKs are expected to be compatible with the |
| 136 | +OoO execution mode, so no changes are expected. |
| 137 | + |
| 138 | +### Cluster Configuration Notification Changes |
| 139 | + |
| 140 | +Prior to server version 7.6, the KV engine had an opt-in feature to push configuration updates to SDKs. This feature |
| 141 | +could be enabled using the HELLO flag `ClustermapChangeNotification` (`0x0d`), which depends on `Duplex` (`0x0c`). More |
| 142 | +details about `Duplex` can be found in [kv\_engine/docs/Duplex.md][kv-duplex]. When both flags are negotiated, the |
| 143 | +server will send unsolicited configuration updates to the SDK without expecting any acknowledgement mechanism. While |
| 144 | +this approach proves to have better responsiveness compared to [RFC-0024: Fast Failover][rfc-0024], it also has its own |
| 145 | +drawbacks, such as: |
| 146 | + |
| 147 | +1. The SDK subscribes all connections using HELLO, and during rebalance, all connections will receive all notifications. |
| 148 | +2. In a Lambda scenario, if failover occurs while the SDK process is paused, upon resuming, the SDK must process all |
| 149 | + updates on all sockets. This process takes unnecessary time, unlike when the SDK polls every 2.5 seconds. |
| 150 | + |
| 151 | +Since version 7.6, the KV engine introduces the HELLO flag `ClustermapChangeNotificationBrief` (`0x1f`). This flag |
| 152 | +instructs the KV engine to exclude the cluster configuration content from the notification. In this case, the data type |
| 153 | +will be `RAW` (`0x00`). Below is the typical structure of the notification when the brief mode is enabled: |
| 154 | + |
| 155 | + |
| 156 | + Byte/ 0 | 1 | 2 | 3 | |
| 157 | + / | | | | |
| 158 | + |0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7| |
| 159 | + +---------------+---------------+---------------+---------------+ |
| 160 | + 0| 0x82 | 0x01 | 0x00 | 0x00 | |
| 161 | + +---------------+---------------+---------------+---------------+ |
| 162 | + 4| 0x00 | 0x00 | 0x00 | 0x00 | |
| 163 | + +---------------+---------------+---------------+---------------+ |
| 164 | + 8| 0x00 | 0x00 | 0x00 | 0x00 | |
| 165 | + +---------------+---------------+---------------+---------------+ |
| 166 | + 12| 0x00 | 0x00 | 0x00 | 0x00 | |
| 167 | + +---------------+---------------+---------------+---------------+ |
| 168 | + 16| 0x00 | 0x00 | 0x00 | 0x00 | |
| 169 | + +---------------+---------------+---------------+---------------+ |
| 170 | + 20| 0x00 | 0x00 | 0x00 | 0x00 | |
| 171 | + +---------------+---------------+---------------+---------------+ |
| 172 | + 24| 0x42 | 0x00 | 0x00 | 0x00 | |
| 173 | + +---------------+---------------+---------------+---------------+ |
| 174 | + 28| 0x00 | 0x00 | 0x00 | 0x00 | |
| 175 | + +---------------+---------------+---------------+---------------+ |
| 176 | + 32| 0x00 | 0x00 | 0x00 | 0x00 | |
| 177 | + +---------------+---------------+---------------+---------------+ |
| 178 | + 36| 0x08 | 0x07 | 0x06 | 0x05 | |
| 179 | + +---------------+---------------+---------------+---------------+ |
| 180 | + 40| 0x04 | 0x03 | 0x02 | 0x01 | |
| 181 | + +---------------+---------------+---------------+---------------+ |
| 182 | + CLUSTERMAP_CHANGE_NOTIFICATION command |
| 183 | + Field (offset) (value) |
| 184 | + Magic (0) : 0x82 (server request, kv_engine -> SDK) |
| 185 | + Opcode (1) : 0x01 |
| 186 | + Key length (2,3) : 0x0000 |
| 187 | + Extra length (4) : 0x10 (two int64_t fields in extras) |
| 188 | + Data type (5) : 0x00 (RAW) |
| 189 | + Vbucket (6,7) : 0x0000 |
| 190 | + Total body (8-11) : 0x00000010 (16 bytes) |
| 191 | + Opaque (12-15): 0x00000000 |
| 192 | + CAS (16-23): 0x0000000000000000 |
| 193 | + Epoch (24-31): 0x0000000000000042 (66 in base-10) |
| 194 | + Revision (32-39): 0x0102030405060708 (72623859790382856 in base-10) |
| 195 | + |
| 196 | +So note that magic is `ServerRequest` (`0x82`), that is enabled by `Duplex` (`0x0c`) HELLO flag. Also note that just |
| 197 | +like in regular cluster configuration notification, epoch and revision fields are sent as extras. |
| 198 | + |
| 199 | +Note that the magic value for this notification is `ServerRequest` (`0x82`), which is enabled by the `Duplex` (`0x0c`) |
| 200 | +HELLO flag. Additionally, similar to the regular cluster configuration notification, the epoch and revision fields are |
| 201 | +sent as extras. |
| 202 | + |
| 203 | +Once the brief cluster configuration notification is received, it is up to the SDK to decide whether to send a |
| 204 | +`GetClusterConfig` (`0xb5`) request to retrieve the actual configuration body. |
| 205 | + |
| 206 | +In essence, the `ClustermapChangeNotificationBrief` feature only saves network traffic. If |
| 207 | +`DedupeNotMyVbucketClustermap` is not enabled, the number of notifications will be the same as before. However, this |
| 208 | +feature can still be used as a building block to implement a debouncing mechanism. When properly configured, it can help |
| 209 | +reduce the number of requests. Further details on this topic will be covered in the "Library Changes" section. |
| 210 | + |
| 211 | +## Library Changes |
| 212 | + |
| 213 | +### Configuration Push |
| 214 | + |
| 215 | +The previously mentioned `ClustermapChangeNotificationBrief` feature enables the SDK to subscribe all connections for |
| 216 | +configuration updates. These notifications are lightweight and can be deduplicated by the server when the |
| 217 | +`DedupeNotMyVbucketClustermap` option is negotiated. |
| 218 | + |
| 219 | +#### Mixed Clusters |
| 220 | + |
| 221 | +In clusters where there is a mix of nodes with older server versions, meaning that some nodes do not acknowledge |
| 222 | +`ClustermapChangeNotificationBrief`, the respective connection should notify the configuration monitor about its lack of |
| 223 | +support for configuration pushes from the server. As a result, the monitor should utilize the old polling mechanism for |
| 224 | +this particular node instead. |
| 225 | + |
| 226 | +### Enhancements in Handling the `NotMyVbucket` Status |
| 227 | + |
| 228 | +Combination of `DedupeNotMyVbucketClustermap` and `ClustermapChangeNotificationBrief` allows to save traffic by not |
| 229 | +sending configuration, if SDK already seen the same revision, and also sends only pair of `Epoch`/`Revision`. So it is |
| 230 | +up to SDK to initiate configuration update once the non-empty payload returned along with `NotMyVbucket` status code. |
| 231 | + |
| 232 | +Several modifications are required in the SDK: |
| 233 | +1. The retry orchestrator should be able to retry an operation based on configuration updates rather than the timer signal. |
| 234 | +2. The configuration monitor should have the ability to throttle configuration requests due to the following reasons: |
| 235 | + 1. During rebalance, multiple operations may return a `NotMyVbucket` status, triggering a configuration refresh. |
| 236 | + 2. Since `ClustermapChangeNotificationBrief` will cause all connections to subscribe to updates and receive them, it is |
| 237 | + necessary to account for potential high volumes of updates. |
| 238 | + |
| 239 | +Below is a diagram that illustrates an example of the SDK workflow, where the GET request is waiting for the arrival of |
| 240 | +a new configuration. |
| 241 | + |
| 242 | +```mermaid |
| 243 | +sequenceDiagram |
| 244 | + autonumber |
| 245 | + conn_1->>+kv_node_1: get("foo", vb=115) |
| 246 | + kv_node_1->>-conn_1: NotMyVbucket(epoch=1, rev=11) |
| 247 | + conn_1-->>+retry_orchestrator: pending(get, "foo", epoch=1, rev=11) |
| 248 | + retry_orchestrator-->retry_orchestrator: put operation to wating queue |
| 249 | + conn_1-->>+config_monitor: refresh configuration |
| 250 | + config_monitor-->config_monitor: wait to throttle config requests |
| 251 | + config_monitor->>+conn_2: get_config() |
| 252 | + conn_2->>+kv_node_2: get_config() |
| 253 | + kv_node_2->>-conn_2: configuration(epoch=1, rev=11) |
| 254 | + conn_2->>-config_monitor: apply new configuration |
| 255 | + config_monitor->>retry_orchestrator: purge waiting queue(epoch=1, rev=11) |
| 256 | + retry_orchestrator->>conn_1: retry get("foo") |
| 257 | + conn_1->>+kv_node_2: get("foo", vb=115) |
| 258 | + kv_node_2->>-conn_1: Success() |
| 259 | +``` |
| 260 | + |
| 261 | +# Language Specifics |
| 262 | + |
| 263 | +## Feature Checklist |
| 264 | + |
| 265 | +1. `GetClusterConfigWithKnownVersion` (`0x1d`). The SDK should always supply current configuration version if the |
| 266 | + connection has acknowledged feature flag. |
| 267 | + |
| 268 | +2. `DedupeNotMyVbucketClustermap` (`0x1e`). The SDK should be ready that the KV engine will not repeat configuration |
| 269 | + payload if it already been sent to the socket by any means (`NotMyVbucket` status, `ClustermapChangeNotification`, |
| 270 | + `GetClusterConfig`). |
| 271 | + |
| 272 | +3. Out-of-Order Execution. `Duplex` (`0x0c`) feature should be always negotiated in HELLO. |
| 273 | + |
| 274 | +4. `ClustermapChangeNotificationBrief` (`0x1f`). The SDK should always subscribe for configuration notifications, if the |
| 275 | + server supports it, and fallback to polling if it does not. |
| 276 | + |
| 277 | +5. SDK should not emit configuration refresh request if there is one already in-flight. This should be independent of |
| 278 | + the source of the signal, as it might come from all the nodes during rebalance when the configuration push is |
| 279 | + enabled, or from `NotMyVbucket` responses. |
| 280 | + |
| 281 | +6. [OPTIONAL] `SnappyEverywhere` (`0x13`). The SDK should be ready that KV engine might send Snappy-compressed payload with any |
| 282 | + of the response types (including push notifications). Check datatype `SNAPPY` (`0x02`). |
| 283 | + |
| 284 | +# Open Questions |
| 285 | + |
| 286 | +1. Behaviour in mixed clusters. Upgrade, when new nodes can push config, while old nodes cannot. Downgrade, when new |
| 287 | + nodes cannot push configuration (should we even consider downgrade?). |
| 288 | + |
| 289 | +2. TBD |
| 290 | + |
| 291 | +3. TBD |
| 292 | + |
| 293 | +# Revisions |
| 294 | + |
| 295 | +* Revision #1 (2023-XX-YY; Sergey Avseyev) |
| 296 | + * Completed initial draft. |
| 297 | + |
| 298 | +# Signoff |
| 299 | + |
| 300 | +| Language | Team Member | Signoff Date | Revision | |
| 301 | +|-------------|----------------|--------------|----------| |
| 302 | +| .NET | Jeffry Morris | | | |
| 303 | +| C/C++ | Sergey Avseyev | | | |
| 304 | +| Go | Charles Dixon | | | |
| 305 | +| Java/Kotlin | David Nault | | | |
| 306 | +| Node.js | Jared Casey | | | |
| 307 | +| PHP | Sergey Avseyev | | | |
| 308 | +| Python | Jared Casey | | | |
| 309 | +| Ruby | Sergey Avseyev | | | |
| 310 | +| Scala | Graham Pople | | | |
| 311 | + |
| 312 | +[kv-unordered-execution]: https://github.com/couchbase/kv_engine/blob/master/docs/UnorderedExecution.md |
| 313 | +[kv-duplex]: https://github.com/couchbase/kv_engine/blob/master/docs/Duplex.md |
| 314 | +[rfc-0005]: rfc/0005-vbucket-retries.md |
| 315 | +[rfc-0024]: rfc/0024-fast-failover.md |
0 commit comments