239 vault cluster authentication #255

georgeharley · 2021-10-22T13:31:01Z

This relates to #239

Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed
Note to contributors: remember to re-generate client set if there are any API changes

Summary Of Changes

We've built a version of the topology operator from these draft changes and carried out some spike testing to verify that it is providing the desired behaviour. A quick inspection of the code will reveal that some tactical decisions were required in order to pull things together.

for the operator to get the Vault server address the VAULT_ADDR environment variable is set on the container spec
the operator's Vault role is currently hard coded
without a secret binding it was necessary to find another way to determine the name of the cluster's K8s service. Have gone with using the same name as the cluster as it looks like they tend to match. Not sure if that is a safe assumption to make
as per earlier issue comments, the Vault path is coming from the the cluster's secretBackend.vault.defaultUserPath
the Vault auth path and namespace used by the operator are coming from annotations on the cluster's secretBackend.vault (vault.hashicorp.com/auth-path and vault.hashicorp.com/namespace respectively)
currently the operator connects to Vault and fetches the required credentials on every run of the reconcile logic. Weren't sure if there was an optimisation opportunity to be had here by creating the Vault connection once on operator startup and then re-using the same connection as necessary. Not that there's anything particularly slow about the performance we are seeing

MarcialRosales

The installation instructions would need to be very explicit with regards setting up the VAULT_ADDR environment variable in the operator's manifest. This is the kind of installation step people miss. However, if someone missed it, the messaging-topology-operator's logs will clearly indicate that the operator is trying to access Vault on the loopback url.

We discussed internally if there could be a way of figuring out the Vault's address without having to manually configure it via an environment variable. It certainly was not trivial and for now, we have considered your approach is good enough.

internal/cluster_reference.go

MarcialRosales · 2021-11-05T17:10:26Z

internal/vault_reader.go

+	// if role == "" {
+	// 	return nil, errors.New("no role value set in Vault secret backend")
+	// }
+	role := "messaging-topology-operator"


We discussed internally the pro's and con's of using a dedicated role for this operator versus using the role defined in the cluster definition (rabbitmqv1beta1.VaultSpec.Role) and the conclusion is that what you did is the best solution, i.e. a dedicated role.

Just for reference here was our analysis:

messaging topology operator having its own role:

Pros:

If we ever needed to prevent the messaging-topology-operator from operating on any cluster we would only need to remove its role from Vault

It allows us to grant different permissions to the cluster-operator and messaging-topology-operator. For instance, the cluster-operator needs to request private key and certificate in addition to looking up secrets whereas the messaging-topology-operator only needs to lookup secrets.

Cons :

If we wanted to have a fine grained control on which cluster the messaging-topology-operator can operate on, we would have to modify the policy associated to the role so that it include/exclude secret paths the operator can access

messaging topology operator using the cluster's role defined in secreBackends.vault.role`

Pros:

We can easily configure which clusters the messaging-topology-operator can operate when we declare the role. For instance, in the example below, we are granting the messaging-topology-operator to operate on the cluster rabbitmq-1 by simply adding the messaging-topology-operator in the bound_service_account_names list.

vault_exec "vault write auth/kubernetes/role/rabbitmq-1-role \ bound_service_account_names=rabbitmq-1,messaging-topology-operator \ bound_service_account_namespaces=$RABBITMQ_NAMESPACE \ policies=rabbitmq-policy ttl=24h"

Cons:

If ever wanted to prevent messaging-topology-operator from operating on any cluster, we have to go over all the roles and remove its SA from the bound_service_account_names. Whereas if we used a dedicated role for the operator, it is just a matter of removing the role as I mentioned earlier.

MarcialRosales · 2021-11-05T17:18:35Z

internal/vault_reader.go

+	// use the Vault token for making all future calls to Vault
+	vaultClient.SetToken(vaultToken)
+
+	return VaultClient{Reader: &VaultSecretReader{client: vaultClient}}, nil


Have you considered caching the VaultClient or the vaultToken? I would imagine authentication is going to be more expensive if we compared it looking up a secret's path. I guess we could cache the vaultToken or the actual initialized VaultClient and reuse it over and over again until the token expires.

I am concerned with the messaging-topology-operator overloading Vault if it has to reconcile way too many resources.

This does not need to be done now. But it is worth considering for future releases.

I would second what @MarcialRosales wrote: since applications will typically create hundreds to thousands of queues and bindings it is well worth caching the vault session

As a strawman, I've made a change to vault_reader.go so that after the initial login to Vault the returned token is periodically renewed in a background thread. The maxiumum TTL of the returned token is also respected so that when reached, a new Vault login is triggered to retrieve a fresh token for subsequent renewals. Would you be happy with this kind of approach?

@georgeharley Sorry I did not reply directly to your question. Yes, I am happy with your approach and in fact it adheres to Vault Go Client library recommendation. I am happy to leave the TTL up to what the server says rather providing one.

MarcialRosales · 2021-11-23T14:10:22Z

internal/vault_reader.go

+			return
+		}
+
+		_, err = login(vaultClient, vaultSpec)


This is a minor issue: When the client is first initialized, it looks like we are logging in twice. Once from InitializeClient() function so that we can fail fast if the credentials are wrong before scheduling the credentials renewal go routing. And then again, within the renewal go routing.

If we wanted to fix it we would have to pass the first vaultLoginResp onto renewToken function and have an index, in the for loop, that tells us whether this is the first time and hence use the vaultLoginResp from the parameter list. Or instead ignore it and login. The code is certainly not as clean as the current one.

Well spotted and apologies for not mentioning it earlier as it was something that caused me some concern. You are correct that on start up there are two Vault logins in very quick succession so we end up with a log that contains something like the following ...

{"level":"info","ts":1637150459.7343986,"msg":"Authenticating to Vault"} {"level":"info","ts":1637150459.746867,"msg":"Authenticating to Vault"}

To avoid this I also looked at keeping the login() call out of InitializeClient() altgether and only carrying it out in the renewal thread but that turned worse IMHO since it created a tiny window of time where the Vault client was trying to carry out its work without a valid Vault token and the log output contained a smattering of "unable to retrieve credentials" messages which looked worse. In the interests of trying to get a strawman in front of you sooner rather than later I opted for the former approach since it is not as messy in the logs and only happens on the first reconciliation after startup. I will see if I can take another pass at this.

@MarcialRosales I've just pushed a small change to the code which stops the double logins occurring on startup. The InitializeClient() function no longer attempts a Vault login but instead waits for the renewToken() function to notify it of the outcome of the its login attempt. The notification comes via a buffered channel passed to renewToken().

renewToken() only writes to the channel on the first login attempt. If that first attempt fails then it seemed like a good opportunity to fail fast and bail out of proceeding with the whole token renewal loop.

Have tried this on a couple of K8s clusters and now only seeing the one "Authenticating to Vault" log message on startup. In the unhappy scenario where the intial Vault login attempt fails then the "Lifecycle management of Vault token will not be carried out" message gets logged and all subsequent reconcile attempts fail with the Vault error.

MarcialRosales

It would have been much nicer if vault.Client would have been an interface rather than a struct. That way we could have mock it up rather than mocking/replacing implementation methods such as ReadVaultClientSecretFunc or LoginToVaultFunc.
But I am afraid, it is what we have.

MarcialRosales · 2021-11-25T08:52:08Z

internal/cluster_reference_test.go

 				}
+
+				fakeSecretStoreClient = &internalfakes.FakeSecretStoreClient{}
+				fakeSecretStoreClient.ReadCredentialsReturns(fakeCredentialsProvider, nil)


@georgeharley When I run make unit-tests this test fails because fakeCredentialsProvider is null at this line which means the fakeSecretStoreClient will return a null CredentialsProvider and test fails. This beforeEach method is invoked before JustBeforeEach method of the Describe.
The behaviour of JustBeforeEach is explained here. It has also trapped me a few times.

One easy way to fix it is by moving the initialization of the fakeCredentialsProvider to the BeforeEach of this test case.

When make unit-tests is run in my environment there doesn't appear to be any failing tests. Could be something weird in my setup or else the randomising of the test specs is somehow always falling in my favour? FWIW I'm using go version 1.17.2, Ginkgo version 1.16.5, and GNU Make version 3.8.1 all running on macOS Monterey 12.0.1.

I'll make the suggested fix to ensure the outcome is predictable.

Test updated.

MarcialRosales · 2021-11-25T08:58:46Z

internal/vault_reader.go

+		}
+
+		// Reduce load on Vault server in a problem situation where repeated login attempts may be made
+		time.Sleep(2 * time.Second)


In the future, this could be replaced by an exponential backoff delay. I'd release it just how it is and if necessary change it to exponential backoff delay.

MarcialRosales

@georgeharley thanks for implementing this feature in the messging-topology-operator. We appreciate it a lot.
You have implemented all the changes very quickly.

I would say we are ready to merge it and we can make further improvements when necessary.

georgeharley · 2021-11-29T15:40:05Z

Thank you for all of your careful feedback on this @MarcialRosales, it is greatly appreciated and has really improved things.

I have just pushed a further small change which relates to the Vault role that the operator uses when logging in to the remote Vault server. The change is to look for the role name from an environment variable called OPERATOR_VAULT_ROLE set in the operator container specification. This proposed environment variable would be a peer of the existing OPERATOR_NAMESPACE variable (this change sticks with using the OPERATOR_* prefix) and VAULT_ADDR variables. Currently, if OPERATOR_VAULT_ROLE is found to be not set, then the code falls back to using a default role of "messaging-topology-operator" with the log messages clarifying the actual role being used for each login attempt.

georgeharley · 2021-11-29T15:43:32Z

If the latest proposed is deemed aceptable then I would be very happy to contribute to the operator installation documentation regarding the Vault support.

MarcialRosales · 2021-11-30T13:52:56Z

internal/vault_reader.go

+	return result
+}
+
+func login(vaultClient *vault.Client, vaultSpec *rabbitmqv1beta1.VaultSpec) (*vault.Secret, error) {


@georgeharley I think I might have found a problem.

We do not want to login to Vault server per reconciliation request hence we login once and we reuse the Vault's authentication token. So far so good, but the problem is that to login with Vault we are first initialising the Vault's client with two settings that comes from the VaultSpec of the first cluster the topology operator has to interact with.

I am not sure if a production environment all RabbitMQ clusters are setup with the same vault.hashicorp.com/namespace and vault.hashicorp.com/auth-path annotations. If all shared the same annotations, we could use the same vault's client for all clusters. Else, we cannot use the same vault's client for all RabbitMq clusters.

Sorry for spotting it too late.

I detected this problem while doing a refactoring of your vault_reader. I was going to suggest a change but I refrained myself from asking you to make refactoring changes after all the work you have put on it. Anyway, the refactoring I was doing were these two:

Avoid mocking implementation methods. Instead, mock up collaborator(s) such as Vault client

Reduce the amount of global variable/methods and singleton. While removing the singleton I found out the issue I described above.

@MarcialRosales @georgeharley
Interesting nuance....all a part of the journey. If I may try to reiterate the problem and make a suggestion. Apologies in advance for the over explanation, just confirming my understanding.

We have two steps in play for Vault integration. Vault Auth vs Vault credential access.

Vault Auth, utilizing the k8s backend, uses the following components:

Vault Address: Established via topology operator environment variable.

SA Token: Currently pulled from the topology-operator container.

Auth Role: Currently defaults to messaging-topology-operator.

Auth Path: Currently pulled from rabbit cluster CR.

Vault Namespace: Currently pulled from rabbit cluster CR.

Once auth'ed, the client will request and be granted access to credentials (path sourced from the CR) based on policies setup on the Auth Role.

The Vault Namespace and Auth Path on the CR is related to the Auth Role & SA used for the Rabbit Cluster rather than the Topology Operator.

MVP Suggestion:

It would seem as if we're close to a configuration where the Topology Operator would be able to cache the Vault token. The changes needed would be to transition the Auth Path & Vault Namespace components from being CR sourced to being specific to the Auth Role, i.e.(env's of OPERATOR_AUTH_PATH, VAULT_NAMESPACE).

MVP Assumptions:

Same Vault endpoint is used by all clusters.

The same Vault Namespace, or no namespace, is used by all clusters.

The Topology Operator role will have access to all of the credential paths.

Future Considerations

Not sure if this would be appropriate, however...

Have the Topology Operator leverage all of the components of the Rabbit Cluster to login to Vault. This would require the Topology Operator to pull the token from the SA secret and use it along with the components in the CR. This would be more of an "assume role" type of pattern and remove the need to manage the Topology Operator Role. This would get a bit more involved as we may then find the need to cache the Vault Token per Rabbit cluster?

@twhite0 , I appreciate a lot your clear explanation !

Your MVP suggestion and assumptions makes a lot of sense, specially for the MVP. The change we have to make is minimal. As you summarized it, we only need to source the auth path and vault namespace from an Environment variable along with the role's environment variable. It makes a lot of sense and I think it is more correct.

Using a dedicated role+auth_path+namespace settings for the topology operator has advantages and disadvantages.
The advantage is that you have a separate vault access control policy for the cluster operator and for the messaging topology operator. The disadvantage is the extra administrative burden in Vault.

Your future considerations is really what I had in mind from the beginning. It seemed very logical. A vault administrator sets the appropriate role, auth-path, namespace for a cluster and this configuration is used by both, cluster and topology operators.
The disadvantage is that the access control is all-or-nothing per cluster, i.e. either both operators can operate on a cluster or they cannot. And, as you pointed out, the messaging topology operator would have to cache a vault token per cluster. The topology operator would have to use a bounded cache with max-size and TTL capabilities to limit the amount of tokens to cache.

If both strategies made sense, we could configure the topology operator with additional setting, vault_auth_strategy with 2 values : assume_role where the topology operator assumes all the auth settings from the cluster, or dedicated_role where the topology operator uses its own dedicated auth settings.

georgeharley · 2021-12-02T13:54:26Z

@MarcialRosales @twhite0 For your consideration I've just pushed a small change to this PR branch so that the Vault login step looks for the Vault namespace and auth path values from environment variables (OPERATOR_VAULT_NAMESPACE and OPERATOR_VAULT_AUTH_PATH respectively) rather than annotations on the current cluster spec. Effectivelty, this is providing the MVP / dedicated_role strategy described above.

A version of the operator built with these changes in place has behaved as expected when run alongside two separate RabbitMQ clusters. The Vault configuration changes required when the second RabbitMQ cluster was added amounted to the following steps:

vault kv put of the new credentials
vault policy write creation of a new policy for read access to the credentials
vault write update the existing operator role to bind it with the new Vault policy as well as the new cluster's k8s service account and namespace

I would be happy to provide more detailed configuration information or maybe examples as part of the operator documentation if this is the agreed direction.

twhite0 · 2022-01-03T15:21:38Z

Happy New Year all!

@yaronp68 & @MarcialRosales: Would like to gently bump this Draft PR to identify next steps for acceptance.

MarcialRosales · 2022-01-03T15:30:52Z

Happy New Year all !
@twhite0 Absolutely . we will do very shortly. Just returned from a long vacation period.

MarcialRosales

Thanks a lot @georgeharley for making this last code change. We can add the second strategy assume_role when necessary.

The only thing left and given that you have kindly volunteered is to add an example.
I have barely drafted the README.md of the new example below. A suggestion for the example folder could be vault-support.
It is only a suggestion, please feel free to change the structure or even the scope of the test. The intention of the suggested test is to clearly demonstrate how to operate on a vault-enabled cluster in a secure manner (with granted access and without granted access). And also that the operator can operate indistinctly on vault-enabled and k8s-secret enabled clusters.

Here goes the the suggested README.md:

HashiCorp Vault Support Example

The RabbitMQ Cluster Operator supports storing RabbitMQ default user (admin) credentials and RabbitMQ server certificates in
HashiCorp Vault and in K8 Secrets. And accordingly, the RabbitMQ Messaging Topology operator can operate on clusters which are vault-enabled or k8s-secrets enabled.

As explained in this KubeCon talk there four different approaches in Kubernetes to consume external secrets:

Direct API
Controller to mirrors secrets in K8s
Sidecar + MutatingWebhookConfiguration
Secrets Store CSI Driver

The RabbitMQ Cluster Operator integrates with vault using the 3rd approach (Sidecar + MutatingWebhookConfiguration) whereas
the RabbitMQ Messaging Topology Operator uses the 1st approach (Direct API).

Vault-related configuration required

-> Minimum settings required (operator configuration and/or vault configuration) in order for the
messaging topology operator to be able to operate on any vault-enabled RabbitMQ cluster

-> Optional settings we can set in the messaging topology operator (OPERATOR_VAULT_NAMESPACE,
OPERATOR_VAULT_AUTH_PATH, OPERATOR_VAULT_ROLE) what they are for and their default value.

Usage

-> Here we explain how to run the test example and what it does

-> We need two scripts, similar to how we did it for the cluster operator, one for setting up the
prerequisites of the test example and another for the actual test case

-> Briefly explain that first we need run the setup.sh script and what prerequisites it fulfils

-> Then explain the scope of the test.sh.
-> The proposed scope is the following:
-> create a queue in vault-enabled cluster where the operator has been granted access to the cluster in Vault
-> failed to create a queue in a vault-enabled cluster where the operator has not yet been granted access to the cluster
-> create a queue in a standard k8s-secret-enabled cluster. This demonstrates that the two security models can coexist
-> (The script has to deploy three clusters)
-> (The script terminates tearing everything down similar to the test.sh in the examples/vault-default-user in the
cluster operator)

* Vault client initialized at startup and passed to each controller * Each controller passes Vault client in calls to ParseRabbitmqClusterReference() and receives a CredentialsProvider from which to pull username and password when accessing RabbitMQ client * Basic fix up of broken tests * More unit test code required * Bumped cluster-operator dependency v1.8.3 -> v1.9.0 * WIP

…ator * Intent is all controllers will be using the same Vault client when reading credentials during reconciliation phase * WIP

* During each controller's reconcile phase the call to ParseRabbitmqClusterReference will be responsible for creating a Vault connection and reading credentials which then get returned to controller in a CredentialsProvider * ParseRabbitmqClusterReference function will try and fetch credentials from Vault if cluster spec contains a Vault secret backend, otherwise will fallback to using K8s Secret in the cluster namespace * More unit testing required * WIP

* allow optional “vault.hashicorp.com/auth-path” to override auth path used when client logs into Vault * allow optional “vault.hashicorp.com/namespace” to guide whether or not client should specify a target namespace in the remote enterprise Vault server * speculative use of these annotations so this is all likely to change * WIP

* Add some additional checks on Vault secret returned from authenticated read of Vault path

* To be undone when a suitable role injection mechanism is implemented * A couple of unit tests have been made pending while the above is in place

* when K8s secrets not being used to supply cluster default user creds then cluster reader needs to get the cluster's K8s service name from somewhere else; based on observation that K8s service name appears to be the same as the cluster CR, going to just use the cluster CR name for now * make existing Vault reader unit test less flakey * WIP

* CredentialsProvider now has a single operation called Data() that returns a named credential attribute (e.g. username, password, etc) in a []byte * Makes the interface more extensible * WIP

* github.com/hashicorp/vault/api v1.1.1 -> v1.3.0 * github.com/onsi/gomega v1.16.0 -> v1.17.0

* SecretStoreClient created exactly once as part of first reconciliation * Vault token owned by SecretStoreClient is continually renewed in a background thread that manages its lifecycle * Lifecycle management thread is cognizant of Vault token's max TTL (defaults to 32 days) and will force a fresh Vault login when reached

* Vault logins are only made from the renewToken thread * renewToken notifies main thread when first login attempt has been made by means of a buffered channel that is passed into it as an argument * token lifecycle management only commences if the first Vault login attempt succeeds * small delay added between Vault login attempts if there are problems renewing a token (e.g. if Vault server cannot be reached because of a temporary problem on the network)

* Ensure fakeCredentialsProvider is properly created before accessed in Ginkgo It block

* When authenticating to Vault the operator will use the role specified by the OPERATOR_VAULT_ROLE environment variable set on the operator container spec (a peer of the OPERATOR_NAMESPACE and VAULT_ADDR environment variables) * If env var not set then default Vault role value of "messaging-topology-operator" will be used

* Vault login namespace now found from OPERATOR_VAULT_NAMESPACE env var in the operator container. If not set then use default Vault namespace * Vault login auth path now found from OPERATOR_VAULT_AUTH_PATH env var in the operator container. If not set then default to using the auth path "auth/kubernetes" * Above means that Vault login step no longer uses any information from the current RabbitMQ cluster spec

MarcialRosales · 2022-01-07T09:41:18Z

@twhite0 I am currently adding the OSS docs for the Vault support in the messaging topology operator at https://github.com/rabbitmq/rabbitmq-website/tree/vault-docs-topology-operator

* setup.sh installs Vault to current cluster and configures it for the work carried out in the test.sh script * test.sh goes through a series of steps to demonstrate how the operator can use Vault for accessing admin credentials for a RabbitMQ cluster * README.md describes the basic flow of the example as well as stating that the RabbitMQ Cluster operator and Messaging Topology operators are pre-requisites for this example

* Replace sleeps of 30 seconds after the Message Topology operator is re-deployed with a function that waits until the operator deployment log contains an indication that its initialisation is complete and it is ready to handle reconciliations

georgeharley · 2022-01-12T17:16:05Z

@MarcialRosales I have added a strawman example under docs/examples/vault-support as suggested. I think it covers the main objectives you were looking for although only two RabbitMQ clusters are involved. Very happy to follow any recommendations or advice for improving this, including changing the whole test scenario if you think it fails to put across the necessary points.

MarcialRosales · 2022-01-13T10:21:39Z

docs/examples/vault-support/README.md

+necessary for the operator container to have one or more environment variables
+set. 
+
+- `VAULT_ADDR` should be set to the URL of the Vault server API


I believe this is the only mandatory settings that have/must be set, isn't it? the other settings are required if the default value is not applicable

Yes, that is correct. I will ammend the wording in that README to make that clear. Thanks.

MarcialRosales · 2022-01-13T10:44:29Z

docs/examples/vault-support/README.md

+Run the [setup.sh](./setup.sh) script to install Vault to the current
+Kubernetes cluster. This will also create the necessary credentials, 
+policy, and configuration in preparation for running the example.
+


Suggestion:

it creates the credentials for the cluster cluster-vault-a

It creates a policy that grants read access to the credentials of the cluster-vault-a

It creates the role messaging-topology-operator which grants the service accounts from the RabbitMQ cluster cluster-vault-a and the messaging-topology-operator to read the cluster credentials.

MarcialRosales · 2022-01-13T10:56:21Z

docs/examples/vault-support/setup.sh

+# bound_service_account_names values follow the pattern "<RabbitmqCluster name>-server”.
+# bound_service_account_namespaces values must include rabbitmq-system (where the messaging topology operator is deployed) and the namespaces where the RabbitmqClusters are deployed.
+#vault_exec "vault write auth/kubernetes/role/messaging-topology-operator bound_service_account_names=cluster-vault-a-server,cluster-vault-b-server,vault-tls-server,messaging-topology-operator bound_service_account_namespaces=$RABBITMQ_NAMESPACE,rabbitmq-system policies=cluster-vault-a-policy,cluster-vault-b-policy ttl=24h"
+vault_exec "vault write auth/kubernetes/role/messaging-topology-operator bound_service_account_names=cluster-vault-a-server,vault-tls-server,messaging-topology-operator bound_service_account_namespaces=$RABBITMQ_NAMESPACE,rabbitmq-system policies=cluster-vault-a-policy ttl=24h"


I would personally prefer to have different roles for the RabbitMQ clusters to access their credentials and a different role for the messaging topology operator. It is especially relevant when we enable TLS in the cluster. In that particular case, the cluster needs write access to the vault's path whereas the messaging topology operator will only ever need read access.

It is more work because we need an extra policy and role defined in Vault and some adjustments to the cluster definitions. But I think it is worth the effort.

Also, I think the vault-tls-server service account is not necessary for this example when we define the role.

MarcialRosales · 2022-01-13T10:57:24Z

docs/examples/vault-support/rabbitmq-clusters.yaml

+      loopback_users = none
+  secretBackend:
+    vault:
+      role: messaging-topology-operator


see comment further down in relation to separating role for the cluster operator and the messaging topology operator

MarcialRosales · 2022-01-13T11:07:45Z

docs/examples/vault-support/setup.sh

+# Create a policy that allows reading of the default user credentials for the RabbitMQ cluster cluster-vault-a
+# The path must be referenced from the RabbitmqCluster cluster-vault-a CRD spec.secretBackend.vault.defaultUserPath
+echo "Creating Vault policy named cluster-vault-a-policy for reading of cluster-vault-a credentials..."
+vault_exec "vault policy write cluster-vault-a-policy - <<EOF


I wonder whether it is useful to add a section that discusses authorization strategies.
In this example, as it stands, we are suggesting a strategy whereby each cluster gets its own policy and role. It offers a great deal of security but i am not sure if it is scalable.

Another approach is to use a single policy and role for a group of clusters using wildcard in policy's paths. For instance, have a policy which grants access to the path secret/data/rabbitmq/finance-group/*. A cluster whose name is "orders" would retrieve its credentials from this path secret/data/rabbitmq/finance-group/creds. And all the clusters in that group must use the same role, e.g. finance-group-role. Any time we create a cluster, we have to include the clusters service account in the role's bound_service_account_names` attribute.

MarcialRosales · 2022-01-13T11:16:33Z

docs/examples/vault-support/test.sh

+
+# Update the Messaging Topology operator deployment, providing it with the Vault related environment variables
+echo "Updating Messaging Topology operator to work with Vault..."
+kubectl set env deployment/messaging-topology-operator -n rabbitmq-system --containers='manager' OPERATOR_VAULT_ROLE=messaging-topology-operator VAULT_ADDR=http://vault.default.svc.cluster.local:8200


Minor issue: VAUILT_ADDR assumes we ran setup.sh from the default namespace.

MarcialRosales · 2022-01-13T11:20:14Z

docs/examples/vault-support/README.md

+
+- `VAULT_ADDR` should be set to the URL of the Vault server API
+- `OPERATOR_VAULT_ROLE` should be set to the name of the Vault role that is used when accessing credentials. This defaults to "messaging-topology-operator"
+- `OPERATOR_VAULT_NAMESPACE` may be set to the Vault namespace to use when the Messaging Topology operator is authenticating. If not set then the default Vault namespace is assumed


Minor suggestion: add a link to the Vault namespace so that it is not confused with kubernetes namespaces.

MarcialRosales · 2022-01-13T11:21:48Z

docs/examples/vault-support/README.md

+set. 
+
+- `VAULT_ADDR` should be set to the URL of the Vault server API
+- `OPERATOR_VAULT_ROLE` should be set to the name of the Vault role that is used when accessing credentials. This defaults to "messaging-topology-operator"


Suggestion: The setup.sh script creates this role with the name messaging-topology-operator

MarcialRosales

Thanks a lot @georgeharley for your contribution ! Greatly appreciated
I have only small suggestions to make to your last changes.
Let me know if you are ok with then and we can proceed with the merge.

* Improve description of setup and test in README * Have the RabbitMQ cluster use a separate Vault role and policy to that used by the Messaging Topology operator * Cater for the setup and scripts being run in Kubernetes namespaces other than default

georgeharley · 2022-01-13T17:00:08Z

Thank you for the detailed feedback @MarcialRosales which definitely improves things. I have just pushed a further commit now that I hope addresses your suggestions 🤞

ChunyiLyu

LGTM Thank you for contributing!

MarcialRosales · 2022-01-19T14:10:23Z

@georgeharley We noticed that most of integration-tests are failing because the default-user's k8s secret has no username and/or password however we are validating them here (https://github.com/georgeharley/messaging-topology-operator/blob/239-vault-cluster-authentication/internal/cluster_reference.go#L115). Before this PR we only validated these fields in the RabbitMqClient implementation. However, the controller test cases only use a fake implementation of the RabbitmqClient hence the test cases are not affected by the fact that default-user's k8s secret has not username/password on it.

We prefer to keep the validation in one place, in the RabbitmqClient rather than in two places. Of course, an alternative would have been to modify suite_test.go to fill up the default-user's k8s secret with a username and password. But we prefer the former.

Is that ok? It is just a matter of removing the username and password validation if-statements from here.

We prefer to merge after the whole pipeline is happy.

* Instead of throwing errors if the cluster K8s secret does not contain username and/or password values just log the findings

georgeharley · 2022-01-19T15:50:02Z

@MarcialRosales that sounds fair enough. Just pushed an update now to log if the credentials are not found in the secret rather than raising errors. Hope that approach is OK?

ansd requested a review from MarcialRosales October 28, 2021 13:26

georgeharley mentioned this pull request Nov 2, 2021

Leverage CR secretbackend configuration to authenticate with cluster #239

Closed

MarcialRosales reviewed Nov 5, 2021

View reviewed changes

georgeharley force-pushed the 239-vault-cluster-authentication branch from a4bef88 to e60c9eb Compare November 14, 2021 15:03

MarcialRosales reviewed Nov 23, 2021

View reviewed changes

georgeharley force-pushed the 239-vault-cluster-authentication branch from 09d63f9 to 65ba095 Compare November 24, 2021 14:30

MarcialRosales reviewed Nov 25, 2021

View reviewed changes

MarcialRosales self-requested a review November 26, 2021 09:23

MarcialRosales approved these changes Nov 26, 2021

View reviewed changes

MarcialRosales self-requested a review November 30, 2021 13:18

MarcialRosales reviewed Nov 30, 2021

View reviewed changes

MarcialRosales reviewed Jan 4, 2022

View reviewed changes

George Harley added 12 commits January 5, 2022 17:25

Return pointer to VaultClient struct when initialising CredentialsLoc…

f550b7e

…ator * Intent is all controllers will be using the same Vault client when reading credentials during reconciliation phase * WIP

Increase unit test coverage of code accessing Vault

b788fe3

Increase unit testing coverage

5b645a9

Only check for status binding if not getting creds from Vault

94a9eb7

* Add some additional checks on Vault secret returned from authenticated read of Vault path

Temporarily hard code the Vault role value

009f8da

* To be undone when a suitable role injection mechanism is implemented * A couple of unit tests have been made pending while the above is in place

Add more checks on data returned from Vault

4a623b7

Update CredentialsProvider to behave more like corev1.Secret

ea7ca41

* CredentialsProvider now has a single operation called Data() that returns a named credential attribute (e.g. username, password, etc) in a []byte * Makes the interface more extensible * WIP

Bump Vault API and Gomega dependencies

446d045

* github.com/hashicorp/vault/api v1.1.1 -> v1.3.0 * github.com/onsi/gomega v1.16.0 -> v1.17.0

George Harley added 5 commits January 5, 2022 17:32

Fix broken unit test

a61b562

* Ensure fakeCredentialsProvider is properly created before accessed in Ginkgo It block

georgeharley force-pushed the 239-vault-cluster-authentication branch from b3e8a95 to 09e6360 Compare January 11, 2022 18:54

MarcialRosales marked this pull request as ready for review January 13, 2022 07:50

MarcialRosales reviewed Jan 13, 2022

View reviewed changes

MarcialRosales approved these changes Jan 14, 2022

View reviewed changes

ChunyiLyu approved these changes Jan 18, 2022

View reviewed changes

Remove validation of default user credentials in cluster_reference

9c51626

* Instead of throwing errors if the cluster K8s secret does not contain username and/or password values just log the findings

MarcialRosales approved these changes Jan 20, 2022

View reviewed changes

MarcialRosales merged commit e8b60a7 into rabbitmq:main Jan 20, 2022

georgeharley deleted the 239-vault-cluster-authentication branch January 20, 2022 10:34

239 vault cluster authentication #255

239 vault cluster authentication #255

Uh oh!

Conversation

georgeharley commented Oct 22, 2021

Summary Of Changes

Uh oh!

MarcialRosales left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcialRosales left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcialRosales Nov 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcialRosales left a comment

Choose a reason for hiding this comment

Uh oh!

georgeharley commented Nov 29, 2021

Uh oh!

georgeharley commented Nov 29, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

MVP Suggestion:

MVP Assumptions:

Future Considerations

Uh oh!

MarcialRosales Dec 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

georgeharley commented Dec 2, 2021

Uh oh!

twhite0 commented Jan 3, 2022

Uh oh!

MarcialRosales commented Jan 3, 2022

Uh oh!

MarcialRosales left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

HashiCorp Vault Support Example

Vault-related configuration required

Usage

Uh oh!

MarcialRosales commented Jan 7, 2022

Uh oh!

georgeharley commented Jan 12, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcialRosales left a comment •

edited

Loading

MarcialRosales Nov 25, 2021 •

edited

Loading

MarcialRosales Dec 2, 2021 •

edited

Loading

MarcialRosales left a comment •

edited

Loading

MarcialRosales Jan 13, 2022 •

edited

Loading

MarcialRosales Jan 13, 2022 •

edited

Loading

MarcialRosales commented Jan 19, 2022 •

edited

Loading