-
Notifications
You must be signed in to change notification settings - Fork 28
Taxonomy of failsafe levels #579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
88ede08
5e0742d
9a1c2cd
e0c87bf
41a75a2
020bf8b
f0f75cb
d475eb1
04be929
367d992
b190440
525d9e8
ba729d4
fbca525
2777c6a
a9633b1
d3fea7f
9dfb9c0
9d22126
437217f
b904df0
57b1d30
05418ff
36c0d7f
2d1663b
358b429
dcd910b
d375608
9d59e63
1f3de87
2a492f8
53c6521
90e311d
0e39254
2a52226
1fbab3a
5ffe31a
7931127
ee531ad
37ec252
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,89 @@ | ||||||||
| --- | ||||||||
| title: Taxonomy of Failsafe Levels | ||||||||
| type: Decision Record | ||||||||
| status: Draft | ||||||||
| track: IaaS | ||||||||
| --- | ||||||||
|
|
||||||||
|
|
||||||||
| ## Abstract | ||||||||
|
|
||||||||
| Talking about redundancy and backups in the context of clouds, the scope under which circumstances these concepts work for various ressources is not clear. | ||||||||
| This decision records aims to define different levels of failure-safety. | ||||||||
josephineSei marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||
| These levels can then be used in standards to clearly set the scope that certain procedures in e.g. OpenStack offer. | ||||||||
|
|
||||||||
| ## Terminology | ||||||||
|
|
||||||||
| Image | ||||||||
| OpenStack resource, server images usually residing in a network storage backend. | ||||||||
josephineSei marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||
| Volume | ||||||||
| OpenStack resource, virtual drive which usually resides in a network storage backend. | ||||||||
| Virtual Machine (abbr. VM) | ||||||||
| IaaS resource, also called server, executes workloads from users. | ||||||||
josephineSei marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||
| Secret | ||||||||
| OpenStack ressource, could be a key or a passphrase or a certificate in Barbican. | ||||||||
josephineSei marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||
| Key Encryption Key (abbr. KEK) | ||||||||
| OpenStack resource, used to encrypt other keys to be able to store them encrypted in a database. | ||||||||
| floating IP (abbr. FIP) | ||||||||
josephineSei marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||
| OpenStack resource, an IP that is usually reachable from the internet. | ||||||||
josephineSei marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||
| Disk | ||||||||
| A physical disc in a deployment. | ||||||||
josephineSei marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||
| Node | ||||||||
| A physical machine in a deployment. | ||||||||
josephineSei marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||
| Cyber threat | ||||||||
| Attacks on the cloud. | ||||||||
josephineSei marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||
|
|
||||||||
| ## Context | ||||||||
|
|
||||||||
| Some standards in will talk about or require procedures to backup resources or have redundancy for resources. | ||||||||
josephineSei marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||
| This decision record should discuss, which failure threats are CSPs facing and will group them into severel level. | ||||||||
josephineSei marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||
| In consequence these levels should be used in standards talking about redundancy or failure-safety. | ||||||||
josephineSei marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||
|
|
||||||||
| ## Decision | ||||||||
|
|
||||||||
| First there needs to be an overview about possible failure cases in deployments: | ||||||||
josephineSei marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||
|
|
||||||||
| | Failure Case | Probability | Consequences | | ||||||||
| |----|-----|----| | ||||||||
| | Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) | | ||||||||
|
||||||||
| | Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) | | |
| | Disk Failure/Loss | High | Permanent data loss in on this disk. Impact depends on type of lost data (data base, user data) | |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to differ between Node Failure/Loss, meaning hardware is irrecoverable damaged and node outage, caused by electricity outage, as both use cases cause different implications. Furthermore, we should define node as computation hardware without disks. This facilitates classification of use case.
| | Node Outage | Medium to High | Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node) | | |
| | Node Failure/Loss (without disks) | Medium to High | Permanent loss of functionality and connectivity of node (impact depends on type of node) | | |
| | Node Outage | Medium to High | Temporary loss of functionality and connectivity of node (impact depends on type of node) | |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rack outage means outage of all nodes. As disks are not damaged, I prefer to limit consequences to
| | Rack Outage | Medium | similar to Disk Failure and Node Outage | | |
| | Rack Outage | Medium | Outage of all nodes in rack | |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, I would omit "data loss" and focus on big consequence. Most protocols are working with acknowledgments. Hence, we can assume, that data loss is temporary. What we really lost are CPU and RAM data, but we should omit these consequences, as we can not prevent or avoid them.
| | Power Outage (Data Center supply) | Medium | potential data loss, temporary loss of functionality and connectivity of node (impact depends on type of node) | | |
| | Power Outage (Data Center supply) | Medium | temporary outage of all nodes in rack (impact depends on type of node) | |
Uh oh!
There was an error while loading. Please reload this page.