-
Notifications
You must be signed in to change notification settings - Fork 0
Initial structure. Filled in the Proposal, Caveats and Risks. Put in the CSI spec in the Details section. #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Filled in the Proposal, Caveats and Risks. Put in the CSI spec in the Details section.
- [User Stories)](#user-stories) | ||
- [Full snapshot backup](#full-snapshot-backup) | ||
- [Incremental snapshot backup](#incremental-snapshot-backup) | ||
- [Notes/Constraints/Caveats](#notesconstraintscaveats) | ||
- [Risks and Mitigations](#risks-and-mitigations) | ||
- [Design Details](#design-details) | ||
- [The SnapshotMetadata Service API](#the-snapshotmetadata-service-api) | ||
- [Kubernetes Components](#kubernetes-components) | ||
- [Custom Resources](#custom-resources) | ||
- [CSISnapshotSessionAccess](#csisnapshotsessionaccess) | ||
- [CSISnapshotSessionService](#csisnapshotsessionservice) | ||
- [CSISnapshotSessionData](#csisnapshotsessiondata) | ||
- [The Snapshot Session Manager](#the-snapshot-session-manager) | ||
- [The External Snapshot Session Sidecar](#the-external-snapshot-session-sidecar) | ||
- [The SP Snapshot Session Service](#the-sp-snapshot-session-service) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put some sections in so I could create cross references at the start, and also updated the TOC here.
> The volume could be attached to a pod with either `Block` or `Filesystem` | ||
[volume modes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#volume-mode). | ||
* Provide an API to retrieve the data blocks of a snapshot. | ||
> The **snapshot access session** mechanism proposed here could conceivably |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using the phrase snapshot access session consistently to describe the mechanism, and not "session" or "snapshot session".
However I have used just "Snapshot Session Manager" and "snapshot session sidecar".
This is a bit inconsistent. Should I just be using "snapshot session"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"snapshot session" sounds good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@TODO Prasad | ||
|
||
### Notes/Constraints/Caveats (Optional) | ||
### Notes/Constraints/Caveats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could think of the caveats below. Please add/suggest additional.
Consider including folks who also work outside the SIG or subproject. | ||
--> | ||
The main vulnerabilities of this proposal are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've come to realize that this is a very important section and the security model is explained here as the mitigation mechanism.
- [Story 1](#story-1) | ||
- [Story 2](#story-2) | ||
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) | ||
- [User Stories)](#user-stories) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- [User Stories)](#user-stories) | |
- [User Stories](#user-stories) |
[nit] typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the hack/update-toc.sh
script to update the TOC, per instructions within the KEP:
Ensure the TOC is wrapped with
<!-- toc --&rt;<!-- /toc --&rt;
tags, and then generate withhack/update-toc.sh
.
> The volume could be attached to a pod with either `Block` or `Filesystem` | ||
[volume modes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#volume-mode). | ||
* Provide an API to retrieve the data blocks of a snapshot. | ||
> The **snapshot access session** mechanism proposed here could conceivably |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"snapshot session" sounds good to me
or the sizes of the volumes and snapshots involved. | ||
|
||
A Kubernetes backup application establishes a snapshot access session by | ||
creating an instance of a [CSISnapshotSessionAccess](#csisnapshotsessionaccess) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
creating an instance of a [CSISnapshotSessionAccess](#csisnapshotsessionaccess) | |
creating an instance of a [SnapshotSessionAccess](#snapshotsessionaccess) |
We debated whether to include CSI*
prefix to the CRs. To keep is consistent with VolumeSnapshots
resources, I think we should not include the prefix in the CRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As previously, I recommend naming these CRDs following existing patterns that the community is already familiar with, to help clarify the intention. E.g.,
CsiSnapshotSessionAccess
-> SnapshotSessionRequest
- both K8s and CSI has similar resources named TokenRequest
CsiSnapshotSessionService
-> SnapshotServiceConfiguration
- there was a comment on the previous PR to use the Configuration
suffix similar to MutatingWebhookConfiguration
, ValidatingWebhookConfiguration
etc.
Any reasons why we want to deviate from these? So far, I haven't heard any agreement or disagreement on this idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
ensure that the invoker has access rights to the VolumeSnapshot | ||
objects specified in the payload. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensure that the invoker has access rights to the VolumeSnapshot | |
objects specified in the payload. | |
ensure that the invoker has access rights to the VolumeSnapshot | |
resource. |
We cannot check the access permission on specific objects. It can be done at the resource level only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing to "access rights to the VolumeSnapshot objects in the Namespace".
- The **CSISnapshotSessionServiceClusterRole** should be used in a | ||
RoleBinding to grant the ServiceAccount used by the | ||
[external-snapshot-session-sidecar](#the-external-snapshot-session-sidecar) | ||
of the CSI driver all access in the CSI driver namespace only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also mention about access to which resources? Also, since we decided to make SnapshotSessionData
CR is cluster scoped resource, we need to create ClusterRoleBinding
instead of RoleBinding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SnapshotSessionData CR has to remain namespaced - the SnapshotSessionService CR can become global.
Would we need a new role for the SnapshotSessionService? Maybe we can leave it namespaced to avoid adding another role?
The RPC calls each return a gRPC stream through which the metadata can be recovered. | ||
|
||
The CSI driver is not involved in the setup or management of the snapshot access session. | ||
The TCP endpoint returned is actually directed to a community provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
community provided
To me, "community" means both the maintainers and users. If this is part of CSI, I'd rephrase it as "directed to the CSI external-snapshot-session sidecar".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to "directed to the external-snapshot-session sidecar". Not sure if "CSI" is applicable here as it is a K8s implementation artifact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a K8s "core" artifact. AIUI, all existing external-*
sidecars are part of CSI because they all lived in the kubernetes-csi GH org.
The CSI driver provided service only focuses on the generation of the metadata | ||
requested. | ||
|
||
The [CSISnapshotSessionAccess](#csisnapshotsessionaccess) CR is animated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "animated" mean in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think puppet to puppet-master! :-)
service permits metadata to be returned in either an ***extent-based*** | ||
format or a ***block*** based format, at the discretion of the CSI driver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at the discretion of the CSI driver.
Does this imply built-in intelligence on the CSI driver to determine which format is most appropriate? Is this format also pre-specified by the user? Feels like a vendor-specific implementation.
@PrasadG193 what do you think of updating the user stories section with the "extent-based" vs. "block-based" scenario i.e. how do they benefit the users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No intelligence implied. I think I've made it clear that it is vendor-specific: at the discretion of the CSI driver
.
This came from existing implementations: VMware uses extents, AWS uses blocks. Previous iterations of the KEP had the same approach.
- The CSI driver's [Snapshot Session Service](#the-sp-snapshot-session-service) | ||
must be capable of serving metadata on a VolumeSnapshot | ||
concurrently with the backup application's use of a PersistentVolume | ||
created on that same VolumeSnapshot. | ||
This is because the backup application has to mount the PersistentVolume | ||
in Block mode in a Pod in order to read and archive the raw snapshot data blocks, | ||
and this read/archive loop will be driven by the stream of snapshot block metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this one. I thought this design shouldn't be influenced by how and when the backup application's data mover pod mounts, reads and access the PV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I softened the language slightly: "... the backup application would likely mount ...".
This point is necessary, IMO, because we need to ensure that such concurrency is possible. I agree we don't need to mandate it.
to implement the necessary security as illustrated in the following figure: | ||
|
||
<!-- Need to update the figure to show non-namespaced CSISnapshotSessionService CR --> | ||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay - I'll add that to the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
of VolumeSnapshots. It will then | ||
search for a [SnapshotMetadata](#the-snapshotmetadata-service-api) service | ||
in the CSI driver for these VolumeSnapshots. | ||
On success, the TCP endpoint and CA certificate of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the CA cert be stored in a cluster-scoped resource that both the backup application and service have access to? This ensures that during cert rotation, there is only one source of truth for the CA cert, instead of multiple ones embedded in CRs floating around in the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as the CA cert is rotated before the session expiry time then there is no problem. The manager does not/should not cache data.
Mutual authentication, however, is not performed, as to do so would require the | ||
server to trust the certificate authorities used by its clients and no | ||
obvious mechanism exists for this purpose. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The client can present a TLS cert signed with the server's CA trust found in the CR. The server will only trust clients that can present client-side certs signed with its CA trust. However, this can get complicated because the client will need access to an cert signer/issuer somewhere.
Instead of saying "no obvious mechanism exists", consider re-phrasing it to something like "MA can be done by requiring the client to produce a cert signed by the server's CA. But a bit complicated due to dependency on an cert signer/issuer - to be re-evaluated for future versions".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the first para: nothing gained by the suggestion as encryption is already present and the sever knows its own certificate.
On the second para, I agree it needs some rewording. I want to convey that there is no "blessed"/"prescribed"/"standardized" K8s way of getting a client certificate that services can trust. Maybe the SIG security team will provide feedback on this.
I will change to use "no standardized mechanism" for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind the 1st paragraph is that the server will not accept any request from client that either can't present a client-side TLS cert, or that its TLS cert is signed by an unrecognized CA trust. It's used more for identity verification. My point is that MA is possible, but not easy and will make our design more complicated.
I will change to use "no standardized mechanism" for now.
That works.
the permissions it needs to access all the | ||
[custom resources defined by this proposal](#custom-resources). | ||
|
||
It is recommended that the security design be reviewed by SIG Security. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this will need to happen.
### Kubernetes Components | ||
The following Kubernetes components are involved at runtime: | ||
|
||
- A community provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, to me "community" means both maintainers and users. If it's maintained by CSI, then it should be just "The CSI Snapshot Session Manager".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not maintained by "CSI": it is part of the K8s implementation of CSI.
I think your interpretation of community is spot-on: all K8s stuff is community provided at this point! This phrase was used by Ben and there were no objections to it in the WG or CSI community meeting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean - thanks. I just assumed that the k8s implementation of CSI is also part of CSI since they all lived in the kubernetes-csi GH repo.
- [Story 1](#story-1) | ||
- [Story 2](#story-2) | ||
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) | ||
- [User Stories)](#user-stories) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the hack/update-toc.sh
script to update the TOC, per instructions within the KEP:
Ensure the TOC is wrapped with
<!-- toc --&rt;<!-- /toc --&rt;
tags, and then generate withhack/update-toc.sh
.
Wow! I hadn't seen that. Cool! We can delay the generation TOC for when we're done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks for working on this 👍.
Initial structure. Filled in the Proposal, Caveats and Risks. Put in the CSI spec in the Details section.
No description provided.