Skip to content
Merged
Changes from 9 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
88ede08
[DRAFT] Create scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei Apr 26, 2024
5e0742d
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei Apr 26, 2024
9a1c2cd
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei Apr 26, 2024
e0c87bf
Apply suggestions from code review
josephineSei Apr 29, 2024
41a75a2
edit more wording
josephineSei Apr 29, 2024
020bf8b
change gloassary section to table
josephineSei Apr 29, 2024
f0f75cb
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei May 2, 2024
d475eb1
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei May 2, 2024
04be929
editing table of classifictaion, as we discussed in the meeting
josephineSei May 24, 2024
367d992
Apply suggestions from code review
josephineSei May 28, 2024
b190440
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei May 28, 2024
525d9e8
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei Jun 12, 2024
ba729d4
K8s failure cases
cah-hbaum Jun 24, 2024
fbca525
Extend glossary with K8s terms and split into sections
martinmo Jun 24, 2024
2777c6a
Categorize the failure scenarios & try to add structure
martinmo Jun 24, 2024
a9633b1
Distinguish between impacts on IaaS and KaaS layer
martinmo Jun 24, 2024
d3fea7f
Merge branch 'main' into taxonomy-of-failsafe-levels
martinmo Jun 24, 2024
9dfb9c0
Fix markdownlint error
martinmo Jun 24, 2024
9d22126
Apply restructuring suggestions by Josephine
martinmo Jul 8, 2024
437217f
Further work on taxonomy draft (WIP)
martinmo Jul 8, 2024
b904df0
Adding glossary at the right point
josephineSei Aug 16, 2024
57b1d30
Extend the context and glossary and make a better consequences table …
josephineSei Aug 16, 2024
05418ff
Merge branch 'main' into taxonomy-of-failsafe-levels
josephineSei Aug 16, 2024
36c0d7f
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei Aug 19, 2024
2d1663b
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei Aug 22, 2024
358b429
Create scs-XXXX-v1-example-impacts-of-failure-scenarios.md
josephineSei Aug 23, 2024
dcd910b
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei Aug 23, 2024
d375608
Merge branch 'main' into taxonomy-of-failsafe-levels
josephineSei Aug 23, 2024
9d59e63
Merge branch 'main' into taxonomy-of-failsafe-levels
josephineSei Aug 26, 2024
1f3de87
Update and rename scs-XXXX-v1-example-impacts-of-failure-scenarios.md…
josephineSei Aug 26, 2024
2a492f8
Update scs-XXXX-w1-example-impacts-of-failure-scenarios.md
josephineSei Aug 26, 2024
53c6521
Apply suggestions from code review
josephineSei Sep 6, 2024
90e311d
Merge branch 'main' into taxonomy-of-failsafe-levels
josephineSei Sep 10, 2024
0e39254
fix(kaas): use PV instead of PVC as this is actually the Volume
jschoone Sep 10, 2024
2a52226
feat(kaas): first proposal for levels on kaas layer
jschoone Sep 10, 2024
1fbab3a
Merge branch 'main' into taxonomy-of-failsafe-levels
josephineSei Sep 25, 2024
5ffe31a
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei Sep 25, 2024
7931127
Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md
josephineSei Sep 25, 2024
ee531ad
Rename scs-XXXX-vN-taxonomy-of-failsafe-levels.md to scs-0118-v1-taxo…
josephineSei Sep 25, 2024
37ec252
Update and rename scs-XXXX-w1-example-impacts-of-failure-scenarios.md…
josephineSei Sep 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 101 additions & 0 deletions Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
title: Taxonomy of Failsafe Levels
type: Decision Record
status: Draft
track: IaaS
---


## Abstract

When talking about redundancy and backups in the context of cloud infrastructures, the scope under which circumstances these concepts apply to various ressources is neither homogenous nor intuitive.
This decision record aims to define different levels of failure-safety.
These levels can then be used in standards to clearly set the scope that certain procedures in e.g. OpenStack offer.

## Glossary

| Term | Explanation |
| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- |
| Virtual Machine | Equals the `server` resource in Nova. |
| Ironic Machine | A physical node managed by Ironic or as a `server` resource in Nova. |
| Ephemeral Storage | Disk storage directly supplied to a virtual machine by Nova. Different from volumes. |
| (Glance) Image | IaaS resource usually storing raw disk data. Managed by the Glance service. |
| (Cinder) Volume | IaaS resource representing block storage disk that can be attached as a virtual disk to virtual machines. Managed by the Cinder service. |
| (Volume) Snapshot | Thinly-provisioned copy-on-write snapshots of volumes. Stored in the same Cinder storage backend as volumes. |
| Volume Type | Attribute of volumes determining storage details of a volume such as backend location or whether the volume will be encrypted. |
| (Barbican) Secret | IaaS resource storing cryptographic assets such as encryption keys. Managed by the Barbican service. |
| Key Encryption Key | IaaS resource, used to encrypt other keys to be able to store them encrypted in a database. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I a key encryption key really an IaaS resource? I thought key encryption keys are stored in configuration files and if this is the case, it is a configuration setting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Glossary is for all maybe unknown phrases to be described. As this standard also concerns Key Encryption Keys, it should be noted in the glossary

| Floating IP | IaaS resource, an IP that is usually routed and accessible from external networks. |
| Disk | A physical disk drive (e.g. HDD, SSD) in the infrastructure. |
| Node | A physical machine in the infrastructure. |
| Cyber threat | Attacks on the infrastructure through the means of electronic access. |

## Context

Some standards provided by the SCS project will talk about or require procedures to backup resources or have redundancy for resources.
This decision record should discuss, which failure threats are CSP-facing and will classify them into several levels.
In consequence these levels should be used in standards concerning redundancy or failure-safety.

## Decision

First there needs to be an overview about possible failure cases in infrastructures as well as their probability of occurance and the damage they may cause:

| Failure Case | Probability | Consequences |
|----|-----|----|
| Disk Failure/Loss | High | Permanent data loss in this disk. Impact depends on type of lost data (data base, user data) |
| Node Failure/Loss (without disks) | Medium to High | Permanent loss of functionality and connectivity of node (impact depends on type of node) |
| Node Outage | Medium to High | Data loss in RAM and temporary loss of functionality and connectivity of node (impact depends on type of node) |
| Rack Outage | Medium | Outage of all nodes in rack |
| Power Outage (Data Center supply) | Medium | temporary outage of all nodes in all racks |
| Fire | Medium | permanent Disk and Node loss in the affected zone |
| Flood | Low | permanent Disk and Node loss in the affected zone |
| Earthquake | Very Low | permanent Disk and Node loss in the affected zone |
| Storm/Tornado | Low | permanent Disk and Node loss in the affected fire zone |
| Cyber threat | High | permanent loss or compromise of data on affected Disk and Node |
| Software Bug | High | permanent loss or compromise of data that trigger the bug up to data on the whole physical machine |

These failure cases can result in temporary (T) or permanent (P) loss of the resource or data within.
Additionally there are a lot of resources in IaaS alone that are more or less affected by these Failure Cases.
The following table shows the impact when no redundancy or failure safety measure is in place:

| Resource | Disk Loss | Node Loss | Rack Loss | Power Loss | natural catastrophy | Cyber threat | Software Bug |
|----|----|----|----|----|----|----|----|
| Image | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | P |
| Volume | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | P |
| User Data on RAM /CPU | | P | P | P | P | T/P | P |
| volume-based VM | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | P |
| ephemeral-based VM | P (if on disk) | P | P | T | P (T if lucky) | T/P | P |
| Ironic-based VM | P (all data on disk) | P | P | T | P (T if lucky) | T/P | P |
| Secret | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | P |
| network configuration (DB objects) | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | P |
| network connectivity (materialization) | | T (if on node) | T/P | T | P (T if lucky) | T/P | T |
| floating IP | P (if on disk) | T (if on node) | T/P | T | P (T if lucky) | T/P | T |

For some cases, this only results in temporary unavailabilities and cloud infrastructures usually have certain mechanisms in place to avoid data loss, like redundancy in storage backends and databases.
So some of these outages are easier to mitigate than others.
A possible way to classify the failure cases into levels considering the matrix of impact would be, to classify the failure cases from small to big ones.
The following table shows such a classification, the occurance probability of a failure case of each class and what resources with user data might be affected.

:::caution

This table only contains examples of failure cases and examples of affected resources.
This should not be used as a replacement for a risk analysis.
The column **user hints** only show examples of standards that may provide this class of failure safety for a certain resource.
Customers should always check, what they can do to protect their data and not rely solely on the CSP.

:::

| Level/Class | Probability | Failure Causes | loss in IaaS | User Hints |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still thinking about the "user hints" column. Putting it next to the other columns is good from some perspectives, as it can be read: I want to achieve 2. Level of failuresafeness, which can be triggered by these failure causes that will result in these losses on IaaS level, so I can do, what is shown in the user hints.
But we wanted the classification not for examples for users, but mainly as a definiton for standards, so maybe we should not reference those standards here.
We could rather use an extra table with example actions(standards, "user has to to things",..) for each level/class or maybe this should rather not be in a decision record, but rather in a guide or so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, @josephineSei. Linking SCS standards in this table may cause a huge synchronization effort. We always have to update this DR, if referenced standards change. I appreciate the column user hints, but would limit to a textual explanation. See my suggestions below...

|---|---|---|-----|-----|
| 1. Level | Very High | small Hardware or Software Failures (e.g. Disk/Node Failure, Software Bug,...) | individual volumes, VMs... | [volume replication](https://docs.scs.community/standards/scs-0114-v1-volume-type-standard) |
| 2. Level | High | important Hardware or Software Failures (e.g. Rack outage, small Fire, Power outage, ...) | limited number of resources, sometimes recoverable | [volume backups](https://github.com/SovereignCloudStack/standards/pull/567) |
| 3. Level | Medium | small catastrophes or major Failures (e.g. fire, regional Power Outage, orchestrated cyber attacks,...) | lots of resources / user data + potentially not recoverable | Availability Zones, user responsibility |
| 4. Level | Low | whole deployment loss (e.g. natural desaster,...) | entire infrastructure, not recoverable | user responsibility |

Based on our research, no similar standardized classification scheme seems to exist currently.
Something close but also very detailed can be found in [this (german)](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Grundschutz/BSI_Standards/standard_200_3.pdf?__blob=publicationFile&v=2) from the BSI.
As we want to focus on IaaS resources and also have an easily understandable structure that can be applied in standards covering replication, redundancy and backups, this document is too detailed.

## Consequences

Using the definition of levels established in this decision record throughout all SCS standards would allow readers to understand up to which level certain procedures or aspects of resources (e.g. volume types or a backend requiring redundancy) would protect their data and/or resource availability.