-
Notifications
You must be signed in to change notification settings - Fork 28
Taxonomy of failsafe levels #579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
88ede08
5e0742d
9a1c2cd
e0c87bf
41a75a2
020bf8b
f0f75cb
d475eb1
04be929
367d992
b190440
525d9e8
ba729d4
fbca525
2777c6a
a9633b1
d3fea7f
9dfb9c0
9d22126
437217f
b904df0
57b1d30
05418ff
36c0d7f
2d1663b
358b429
dcd910b
d375608
9d59e63
1f3de87
2a492f8
53c6521
90e311d
0e39254
2a52226
1fbab3a
5ffe31a
7931127
ee531ad
37ec252
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,86 @@ | ||||||||
| --- | ||||||||
| title: Taxonomy of Failsafe Levels | ||||||||
| type: Decision Record | ||||||||
| status: Draft | ||||||||
| track: IaaS | ||||||||
| --- | ||||||||
|
|
||||||||
|
|
||||||||
| ## Abstract | ||||||||
|
|
||||||||
| When talking about redundancy and backups in the context of cloud infrastructures, the scope under which circumstances these concepts apply to various ressources is neither homogenous nor intuitive. | ||||||||
| This decision record aims to define different levels of failure-safety. | ||||||||
| These levels can then be used in standards to clearly set the scope that certain procedures in e.g. OpenStack offer. | ||||||||
|
|
||||||||
| ## Glossary | ||||||||
|
|
||||||||
| | Term | Explanation | | ||||||||
| | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | | ||||||||
| | Virtual Machine | Equals the `server` resource in Nova. | | ||||||||
| | Ephemeral Storage | Disk storage directly supplied to a virtual machine by Nova. Different from volumes. | | ||||||||
| | (Glance) Image | IaaS resource usually storing raw disk data. Managed by the Glance service. | | ||||||||
| | (Cinder) Volume | IaaS resource representing block storage disk that can be attached as a virtual disk to virtual machines. Managed by the Cinder service. | | ||||||||
| | (Volume) Snapshot | Thinly-provisioned copy-on-write snapshots of volumes. Stored in the same Cinder storage backend as volumes. | | ||||||||
| | Volume Type | Attribute of volumes determining storage details of a volume such as backend location or whether the volume will be encrypted. | | ||||||||
| | (Barbican) Secret | IaaS resource storing cryptographic assets such as encryption keys. Managed by the Barbican service. | | ||||||||
| | Key Encryption Key | IaaS resource, used to encrypt other keys to be able to store them encrypted in a database. | | ||||||||
|
||||||||
| | Floating IP | IaaS resource, an IP that is usually routed and accessible from external networks. | | ||||||||
| | Disk | A physical disk drive (e.g. HDD, SSD) in the infrastructure. | | ||||||||
| | Node | A physical machine in the infrastructure. | | ||||||||
josephineSei marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||
| | Cyber threat | Attacks on the infrastructure through the means of electronic access. | | ||||||||
|
|
||||||||
| ## Context | ||||||||
|
|
||||||||
| Some standards provided by the SCS project will talk about or require procedures to backup resources or have redundancy for resources. | ||||||||
| This decision record should discuss, which failure threats are CSP-facing and will classify them into several levels. | ||||||||
| In consequence these levels should be used in standards concerning redundancy or failure-safety. | ||||||||
|
|
||||||||
| ## Decision | ||||||||
|
|
||||||||
| First there needs to be an overview about possible failure cases in infrastructures: | ||||||||
|
||||||||
| First there needs to be an overview about possible failure cases in infrastructures: | |
| First there needs to be an overview about possible failure cases in infrastructures as well as their probability of entry and the damage they may cause. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In favor to simplicity, I would assume disk loss/failure will cause permanent loss of data on this disk.
| | Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) | | |
| | Disk Failure/Loss | High | Permanent data loss in on this disk. Impact depends on type of lost data (data base, user data) | |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to differ between Node Failure/Loss, meaning hardware is irrecoverable damaged and node outage, caused by electricity outage, as both use cases cause different implications. Furthermore, we should define node as computation hardware without disks. This facilitates classification of use case.
| | Node Outage | Medium to High | Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node) | | |
| | Node Failure/Loss (without disks) | Medium to High | Permanent loss of functionality and connectivity of node (impact depends on type of node) | | |
| | Node Outage | Medium to High | Temporary loss of functionality and connectivity of node (impact depends on type of node) | |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rack outage means outage of all nodes. As disks are not damaged, I prefer to limit consequences to
| | Rack Outage | Medium | similar to Disk Failure and Node Outage | | |
| | Rack Outage | Medium | Outage of all nodes in rack | |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, I would omit "data loss" and focus on big consequence. Most protocols are working with acknowledgments. Hence, we can assume, that data loss is temporary. What we really lost are CPU and RAM data, but we should omit these consequences, as we can not prevent or avoid them.
| | Power Outage (Data Center supply) | Medium | potential data loss, temporary loss of functionality and connectivity of node (impact depends on type of node) | | |
| | Power Outage (Data Center supply) | Medium | temporary outage of all nodes in rack (impact depends on type of node) | |
Uh oh!
There was an error while loading. Please reload this page.