Skip to content
Merged
Changes from 3 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
88ede08
[DRAFT] Create scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei Apr 26, 2024
5e0742d
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei Apr 26, 2024
9a1c2cd
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei Apr 26, 2024
e0c87bf
Apply suggestions from code review
josephineSei Apr 29, 2024
41a75a2
edit more wording
josephineSei Apr 29, 2024
020bf8b
change gloassary section to table
josephineSei Apr 29, 2024
f0f75cb
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei May 2, 2024
d475eb1
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei May 2, 2024
04be929
editing table of classifictaion, as we discussed in the meeting
josephineSei May 24, 2024
367d992
Apply suggestions from code review
josephineSei May 28, 2024
b190440
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei May 28, 2024
525d9e8
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei Jun 12, 2024
ba729d4
K8s failure cases
cah-hbaum Jun 24, 2024
fbca525
Extend glossary with K8s terms and split into sections
martinmo Jun 24, 2024
2777c6a
Categorize the failure scenarios & try to add structure
martinmo Jun 24, 2024
a9633b1
Distinguish between impacts on IaaS and KaaS layer
martinmo Jun 24, 2024
d3fea7f
Merge branch 'main' into taxonomy-of-failsafe-levels
martinmo Jun 24, 2024
9dfb9c0
Fix markdownlint error
martinmo Jun 24, 2024
9d22126
Apply restructuring suggestions by Josephine
martinmo Jul 8, 2024
437217f
Further work on taxonomy draft (WIP)
martinmo Jul 8, 2024
b904df0
Adding glossary at the right point
josephineSei Aug 16, 2024
57b1d30
Extend the context and glossary and make a better consequences table …
josephineSei Aug 16, 2024
05418ff
Merge branch 'main' into taxonomy-of-failsafe-levels
josephineSei Aug 16, 2024
36c0d7f
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei Aug 19, 2024
2d1663b
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei Aug 22, 2024
358b429
Create scs-XXXX-v1-example-impacts-of-failure-scenarios.md
josephineSei Aug 23, 2024
dcd910b
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei Aug 23, 2024
d375608
Merge branch 'main' into taxonomy-of-failsafe-levels
josephineSei Aug 23, 2024
9d59e63
Merge branch 'main' into taxonomy-of-failsafe-levels
josephineSei Aug 26, 2024
1f3de87
Update and rename scs-XXXX-v1-example-impacts-of-failure-scenarios.md…
josephineSei Aug 26, 2024
2a492f8
Update scs-XXXX-w1-example-impacts-of-failure-scenarios.md
josephineSei Aug 26, 2024
53c6521
Apply suggestions from code review
josephineSei Sep 6, 2024
90e311d
Merge branch 'main' into taxonomy-of-failsafe-levels
josephineSei Sep 10, 2024
0e39254
fix(kaas): use PV instead of PVC as this is actually the Volume
jschoone Sep 10, 2024
2a52226
feat(kaas): first proposal for levels on kaas layer
jschoone Sep 10, 2024
1fbab3a
Merge branch 'main' into taxonomy-of-failsafe-levels
josephineSei Sep 25, 2024
5ffe31a
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei Sep 25, 2024
7931127
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei Sep 25, 2024
ee531ad
Rename scs-XXXX-vN-taxonomy-of-failsafe-levels.md to scs-0118-v1-taxo…
josephineSei Sep 25, 2024
37ec252
Update and rename scs-XXXX-w1-example-impacts-of-failure-scenarios.md…
josephineSei Sep 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 89 additions & 0 deletions Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
---
title: Taxonomy of Failsafe Levels
type: Decision Record
status: Draft
track: IaaS
---


## Abstract

Talking about redundancy and backups in the context of clouds, the scope under which circumstances these concepts work for various ressources is not clear.
This decision records aims to define different levels of failure-safety.
These levels can then be used in standards to clearly set the scope that certain procedures in e.g. OpenStack offer.

## Terminology

Image
OpenStack resource, server images usually residing in a network storage backend.
Volume
OpenStack resource, virtual drive which usually resides in a network storage backend.
Virtual Machine (abbr. VM)
IaaS resource, also called server, executes workloads from users.
Secret
OpenStack ressource, could be a key or a passphrase or a certificate in Barbican.
Key Encryption Key (abbr. KEK)
OpenStack resource, used to encrypt other keys to be able to store them encrypted in a database.
floating IP (abbr. FIP)
OpenStack resource, an IP that is usually reachable from the internet.
Disk
A physical disc in a deployment.
Node
A physical machine in a deployment.
Cyber threat
Attacks on the cloud.

## Context

Some standards in will talk about or require procedures to backup resources or have redundancy for resources.
This decision record should discuss, which failure threats are CSPs facing and will group them into severel level.
In consequence these levels should be used in standards talking about redundancy or failure-safety.

## Decision

First there needs to be an overview about possible failure cases in deployments:

| Failure Case | Probability | Consequences |
|----|-----|----|
| Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In favor to simplicity, I would assume disk loss/failure will cause permanent loss of data on this disk.

Suggested change
| Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) |
| Disk Failure/Loss | High | Permanent data loss in on this disk. Impact depends on type of lost data (data base, user data) |

| Node Outage | Medium to High | Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to differ between Node Failure/Loss, meaning hardware is irrecoverable damaged and node outage, caused by electricity outage, as both use cases cause different implications. Furthermore, we should define node as computation hardware without disks. This facilitates classification of use case.

Suggested change
| Node Outage | Medium to High | Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node) |
| Node Failure/Loss (without disks) | Medium to High | Permanent loss of functionality and connectivity of node (impact depends on type of node) |
| Node Outage | Medium to High | Temporary loss of functionality and connectivity of node (impact depends on type of node) |

| Rack Outage | Medium | similar to Disk Failure and Node Outage |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rack outage means outage of all nodes. As disks are not damaged, I prefer to limit consequences to

Suggested change
| Rack Outage | Medium | similar to Disk Failure and Node Outage |
| Rack Outage | Medium | Outage of all nodes in rack |

| Power Outage (Data Center supply) | Medium | potential data loss, temporary loss of functionality and connectivity of node (impact depends on type of node) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, I would omit "data loss" and focus on big consequence. Most protocols are working with acknowledgments. Hence, we can assume, that data loss is temporary. What we really lost are CPU and RAM data, but we should omit these consequences, as we can not prevent or avoid them.

Suggested change
| Power Outage (Data Center supply) | Medium | potential data loss, temporary loss of functionality and connectivity of node (impact depends on type of node) |
| Power Outage (Data Center supply) | Medium | temporary outage of all nodes in rack (impact depends on type of node) |

| Fire | Medium | permanently Disk and Node loss in the affected zone |
| Flood | Low | permanently Disk and Node loss in the affected zone |
| Earthquake | Very Low | permanently Disk and Node loss in the affected zone |
| Storm/Tornado | Low | permanently Disk and Node loss in the affected fire zone |
| Cyber threat | High | permanently loss of data on affected Disk and Node |

These failure case can result in temporary (T) or permanent (P) loss of the resource or data within.
Additionally there are a lot of resources in IaaS alone that are more or less affected by these Failure Cases.
The following table shows the affection without considering any redundancy or failure saftey being in use:

| Resource | Disk Loss | Node Loss | Rack Loss | Power Loss | natural catastrophy | Cyber threat |
|----|----|----|----|----|----|----|
| Image | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P |
| Volume | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P |
| User Data on RAM /CPU | | P | P | P | P | T/P |
| volume-based VM | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P |
| ephemeral-based VM | P (if on disk) | P | P | T | P (T if lucky) | T/P |
| Secret | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P |
| network configuration (DB objects) | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P |
| network connectivity (materialization) | | T (if on node) | T/P | T | P (T if lucky) | T/P |
| floating IP | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P |

For some cases there are only temporary unavailabilites and clouds do have certain workflows to avoid data loss, like redundancy in storagy backends and databases.
So some of these outages are more easy to solve than others.
A possible way to group the failure cases into levels considering the matrix of affection would be:

| Level/Class | level of affection | Use Cases |
|---|---|-----|
| 1. Level | single volumes, VMs... | Disk Failure, Node outage, (maybe rack outage) |
| 2. Level | number of resources, most of the time recoverable | Rack outage, (Fire), (Power outage when different power supplies exist) |
| 3. Level | lots of resources / user data + potentially not recoverable | Fire, Earthquake, Storm/Tornado, Power Outage |
| 4. Level | complete deployment, not recoverable | Flood, Fire |

Unfortunately something similar does not seem to exist right now.

## Consequences

Using the definition of Levels throughout all SCS standards would allow readers to know up to which Level certain procedures or aspects of resources (e.g. volume types or a backend requiring redundancy) would protect their data.