Skip to content

Conversation

carlbraganza
Copy link
Collaborator

No description provided.

Filled in the Proposal, Caveats and Risks.
Put in the CSI spec in the Details section.
Comment on lines 86 to 100
- [User Stories)](#user-stories)
- [Full snapshot backup](#full-snapshot-backup)
- [Incremental snapshot backup](#incremental-snapshot-backup)
- [Notes/Constraints/Caveats](#notesconstraintscaveats)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [The SnapshotMetadata Service API](#the-snapshotmetadata-service-api)
- [Kubernetes Components](#kubernetes-components)
- [Custom Resources](#custom-resources)
- [CSISnapshotSessionAccess](#csisnapshotsessionaccess)
- [CSISnapshotSessionService](#csisnapshotsessionservice)
- [CSISnapshotSessionData](#csisnapshotsessiondata)
- [The Snapshot Session Manager](#the-snapshot-session-manager)
- [The External Snapshot Session Sidecar](#the-external-snapshot-session-sidecar)
- [The SP Snapshot Session Service](#the-sp-snapshot-session-service)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put some sections in so I could create cross references at the start, and also updated the TOC here.

> The volume could be attached to a pod with either `Block` or `Filesystem`
[volume modes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#volume-mode).
* Provide an API to retrieve the data blocks of a snapshot.
> The **snapshot access session** mechanism proposed here could conceivably
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using the phrase snapshot access session consistently to describe the mechanism, and not "session" or "snapshot session".

However I have used just "Snapshot Session Manager" and "snapshot session sidecar".
This is a bit inconsistent. Should I just be using "snapshot session"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"snapshot session" sounds good to me

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@TODO Prasad

### Notes/Constraints/Caveats (Optional)
### Notes/Constraints/Caveats
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could think of the caveats below. Please add/suggest additional.

Consider including folks who also work outside the SIG or subproject.
-->
The main vulnerabilities of this proposal are:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've come to realize that this is a very important section and the security model is explained here as the mitigation mechanism.

- [Story 1](#story-1)
- [Story 2](#story-2)
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
- [User Stories)](#user-stories)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [User Stories)](#user-stories)
- [User Stories](#user-stories)

[nit] typo

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the hack/update-toc.sh script to update the TOC, per instructions within the KEP:

Ensure the TOC is wrapped with
<!-- toc --&rt;<!-- /toc --&rt;
tags, and then generate with hack/update-toc.sh.

> The volume could be attached to a pod with either `Block` or `Filesystem`
[volume modes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#volume-mode).
* Provide an API to retrieve the data blocks of a snapshot.
> The **snapshot access session** mechanism proposed here could conceivably
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"snapshot session" sounds good to me

or the sizes of the volumes and snapshots involved.

A Kubernetes backup application establishes a snapshot access session by
creating an instance of a [CSISnapshotSessionAccess](#csisnapshotsessionaccess)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
creating an instance of a [CSISnapshotSessionAccess](#csisnapshotsessionaccess)
creating an instance of a [SnapshotSessionAccess](#snapshotsessionaccess)

We debated whether to include CSI* prefix to the CRs. To keep is consistent with VolumeSnapshots resources, I think we should not include the prefix in the CRs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As previously, I recommend naming these CRDs following existing patterns that the community is already familiar with, to help clarify the intention. E.g.,

CsiSnapshotSessionAccess -> SnapshotSessionRequest - both K8s and CSI has similar resources named TokenRequest
CsiSnapshotSessionService -> SnapshotServiceConfiguration - there was a comment on the previous PR to use the Configuration suffix similar to MutatingWebhookConfiguration, ValidatingWebhookConfiguration etc.

Any reasons why we want to deviate from these? So far, I haven't heard any agreement or disagreement on this idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Comment on lines 420 to 421
ensure that the invoker has access rights to the VolumeSnapshot
objects specified in the payload.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ensure that the invoker has access rights to the VolumeSnapshot
objects specified in the payload.
ensure that the invoker has access rights to the VolumeSnapshot
resource.

We cannot check the access permission on specific objects. It can be done at the resource level only.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing to "access rights to the VolumeSnapshot objects in the Namespace".

- The **CSISnapshotSessionServiceClusterRole** should be used in a
RoleBinding to grant the ServiceAccount used by the
[external-snapshot-session-sidecar](#the-external-snapshot-session-sidecar)
of the CSI driver all access in the CSI driver namespace only.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also mention about access to which resources? Also, since we decided to make SnapshotSessionData CR is cluster scoped resource, we need to create ClusterRoleBinding instead of RoleBinding

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SnapshotSessionData CR has to remain namespaced - the SnapshotSessionService CR can become global.

Would we need a new role for the SnapshotSessionService? Maybe we can leave it namespaced to avoid adding another role?

The RPC calls each return a gRPC stream through which the metadata can be recovered.

The CSI driver is not involved in the setup or management of the snapshot access session.
The TCP endpoint returned is actually directed to a community provided
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

community provided

To me, "community" means both the maintainers and users. If this is part of CSI, I'd rephrase it as "directed to the CSI external-snapshot-session sidecar".

Copy link
Collaborator Author

@carlbraganza carlbraganza Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "directed to the external-snapshot-session sidecar". Not sure if "CSI" is applicable here as it is a K8s implementation artifact.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a K8s "core" artifact. AIUI, all existing external-* sidecars are part of CSI because they all lived in the kubernetes-csi GH org.

The CSI driver provided service only focuses on the generation of the metadata
requested.

The [CSISnapshotSessionAccess](#csisnapshotsessionaccess) CR is animated
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "animated" mean in this context?

Copy link
Collaborator Author

@carlbraganza carlbraganza Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think puppet to puppet-master! :-)

Comment on lines 373 to 374
service permits metadata to be returned in either an ***extent-based***
format or a ***block*** based format, at the discretion of the CSI driver.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the discretion of the CSI driver.

Does this imply built-in intelligence on the CSI driver to determine which format is most appropriate? Is this format also pre-specified by the user? Feels like a vendor-specific implementation.

@PrasadG193 what do you think of updating the user stories section with the "extent-based" vs. "block-based" scenario i.e. how do they benefit the users?

Copy link
Collaborator Author

@carlbraganza carlbraganza Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No intelligence implied. I think I've made it clear that it is vendor-specific: at the discretion of the CSI driver.

This came from existing implementations: VMware uses extents, AWS uses blocks. Previous iterations of the KEP had the same approach.

Comment on lines 383 to 389
- The CSI driver's [Snapshot Session Service](#the-sp-snapshot-session-service)
must be capable of serving metadata on a VolumeSnapshot
concurrently with the backup application's use of a PersistentVolume
created on that same VolumeSnapshot.
This is because the backup application has to mount the PersistentVolume
in Block mode in a Pod in order to read and archive the raw snapshot data blocks,
and this read/archive loop will be driven by the stream of snapshot block metadata.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this one. I thought this design shouldn't be influenced by how and when the backup application's data mover pod mounts, reads and access the PV?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I softened the language slightly: "... the backup application would likely mount ...".

This point is necessary, IMO, because we need to ensure that such concurrency is possible. I agree we don't need to mandate it.

to implement the necessary security as illustrated in the following figure:

<!-- Need to update the figure to show non-namespaced CSISnapshotSessionService CR -->
![CSI Snapshot Session Roles](./cbt-kep-multi-csi-roles.drawio.svg)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a white background. The transparent background doesn't work well with GH dark theme.

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay - I'll add that to the comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

of VolumeSnapshots. It will then
search for a [SnapshotMetadata](#the-snapshotmetadata-service-api) service
in the CSI driver for these VolumeSnapshots.
On success, the TCP endpoint and CA certificate of the
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the CA cert be stored in a cluster-scoped resource that both the backup application and service have access to? This ensures that during cert rotation, there is only one source of truth for the CA cert, instead of multiple ones embedded in CRs floating around in the cluster.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the CA cert is rotated before the session expiry time then there is no problem. The manager does not/should not cache data.

Comment on lines 441 to 443
Mutual authentication, however, is not performed, as to do so would require the
server to trust the certificate authorities used by its clients and no
obvious mechanism exists for this purpose.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client can present a TLS cert signed with the server's CA trust found in the CR. The server will only trust clients that can present client-side certs signed with its CA trust. However, this can get complicated because the client will need access to an cert signer/issuer somewhere.

Instead of saying "no obvious mechanism exists", consider re-phrasing it to something like "MA can be done by requiring the client to produce a cert signed by the server's CA. But a bit complicated due to dependency on an cert signer/issuer - to be re-evaluated for future versions".

Copy link
Collaborator Author

@carlbraganza carlbraganza Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the first para: nothing gained by the suggestion as encryption is already present and the sever knows its own certificate.

On the second para, I agree it needs some rewording. I want to convey that there is no "blessed"/"prescribed"/"standardized" K8s way of getting a client certificate that services can trust. Maybe the SIG security team will provide feedback on this.

I will change to use "no standardized mechanism" for now.

Copy link
Owner

@ihcsim ihcsim Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind the 1st paragraph is that the server will not accept any request from client that either can't present a client-side TLS cert, or that its TLS cert is signed by an unrecognized CA trust. It's used more for identity verification. My point is that MA is possible, but not easy and will make our design more complicated.

I will change to use "no standardized mechanism" for now.

That works.

the permissions it needs to access all the
[custom resources defined by this proposal](#custom-resources).

It is recommended that the security design be reviewed by SIG Security.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this will need to happen.

### Kubernetes Components
The following Kubernetes components are involved at runtime:

- A community provided
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, to me "community" means both maintainers and users. If it's maintained by CSI, then it should be just "The CSI Snapshot Session Manager".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not maintained by "CSI": it is part of the K8s implementation of CSI.

I think your interpretation of community is spot-on: all K8s stuff is community provided at this point! This phrase was used by Ben and there were no objections to it in the WG or CSI community meeting.

Copy link
Owner

@ihcsim ihcsim Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean - thanks. I just assumed that the k8s implementation of CSI is also part of CSI since they all lived in the kubernetes-csi GH repo.

- [Story 1](#story-1)
- [Story 2](#story-2)
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
- [User Stories)](#user-stories)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the hack/update-toc.sh script to update the TOC, per instructions within the KEP:

Ensure the TOC is wrapped with
<!-- toc --&rt;<!-- /toc --&rt;
tags, and then generate with hack/update-toc.sh.

@carlbraganza
Copy link
Collaborator Author

Use the hack/update-toc.sh script to update the TOC, per instructions within the KEP:

Wow! I hadn't seen that. Cool! We can delay the generation TOC for when we're done.

Copy link
Owner

@ihcsim ihcsim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for working on this 👍.

@carlbraganza carlbraganza merged commit ffe8562 into csi-cbt-kep Jun 29, 2023
carlbraganza added a commit that referenced this pull request Jul 3, 2023
Initial structure. Filled in the Proposal, Caveats and Risks. Put in the CSI spec in the Details section.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants