Skip to content

Conversation

@georgeharley
Copy link

This relates to #239

Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed
Note to contributors: remember to re-generate client set if there are any API changes

Summary Of Changes

We've built a version of the topology operator from these draft changes and carried out some spike testing to verify that it is providing the desired behaviour. A quick inspection of the code will reveal that some tactical decisions were required in order to pull things together.

  • for the operator to get the Vault server address the VAULT_ADDR environment variable is set on the container spec
  • the operator's Vault role is currently hard coded
  • without a secret binding it was necessary to find another way to determine the name of the cluster's K8s service. Have gone with using the same name as the cluster as it looks like they tend to match. Not sure if that is a safe assumption to make
  • as per earlier issue comments, the Vault path is coming from the the cluster's secretBackend.vault.defaultUserPath
  • the Vault auth path and namespace used by the operator are coming from annotations on the cluster's secretBackend.vault (vault.hashicorp.com/auth-path and vault.hashicorp.com/namespace respectively)
  • currently the operator connects to Vault and fetches the required credentials on every run of the reconcile logic. Weren't sure if there was an optimisation opportunity to be had here by creating the Vault connection once on operator startup and then re-using the same connection as necessary. Not that there's anything particularly slow about the performance we are seeing

Copy link

@MarcialRosales MarcialRosales left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The installation instructions would need to be very explicit with regards setting up the VAULT_ADDR environment variable in the operator's manifest. This is the kind of installation step people miss. However, if someone missed it, the messaging-topology-operator's logs will clearly indicate that the operator is trying to access Vault on the loopback url.

We discussed internally if there could be a way of figuring out the Vault's address without having to manually configure it via an environment variable. It certainly was not trivial and for now, we have considered your approach is good enough.

// if role == "" {
// return nil, errors.New("no role value set in Vault secret backend")
// }
role := "messaging-topology-operator"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed internally the pro's and con's of using a dedicated role for this operator versus using the role defined in the cluster definition (rabbitmqv1beta1.VaultSpec.Role) and the conclusion is that what you did is the best solution, i.e. a dedicated role.

Just for reference here was our analysis:

messaging topology operator having its own role:

Pros:

  • If we ever needed to prevent the messaging-topology-operator from operating on any cluster we would only need to remove its role from Vault
  • It allows us to grant different permissions to the cluster-operator and messaging-topology-operator. For instance, the cluster-operator needs to request private key and certificate in addition to looking up secrets whereas the messaging-topology-operator only needs to lookup secrets.

Cons :

  • If we wanted to have a fine grained control on which cluster the messaging-topology-operator can operate on, we would have to modify the policy associated to the role so that it include/exclude secret paths the operator can access

messaging topology operator using the cluster's role defined in secreBackends.vault.role`

Pros:

  • We can easily configure which clusters the messaging-topology-operator can operate when we declare the role. For instance, in the example below, we are granting the messaging-topology-operator to operate on the cluster rabbitmq-1 by simply adding the messaging-topology-operator in the bound_service_account_names list.
vault_exec "vault write auth/kubernetes/role/rabbitmq-1-role \
 bound_service_account_names=rabbitmq-1,messaging-topology-operator \
 bound_service_account_namespaces=$RABBITMQ_NAMESPACE \
 policies=rabbitmq-policy ttl=24h"

Cons:

  • If ever wanted to prevent messaging-topology-operator from operating on any cluster, we have to go over all the roles and remove its SA from the bound_service_account_names. Whereas if we used a dedicated role for the operator, it is just a matter of removing the role as I mentioned earlier.

// use the Vault token for making all future calls to Vault
vaultClient.SetToken(vaultToken)

return VaultClient{Reader: &VaultSecretReader{client: vaultClient}}, nil

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered caching the VaultClient or the vaultToken? I would imagine authentication is going to be more expensive if we compared it looking up a secret's path. I guess we could cache the vaultToken or the actual initialized VaultClient and reuse it over and over again until the token expires.

I am concerned with the messaging-topology-operator overloading Vault if it has to reconcile way too many resources.

This does not need to be done now. But it is worth considering for future releases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would second what @MarcialRosales wrote: since applications will typically create hundreds to thousands of queues and bindings it is well worth caching the vault session

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a strawman, I've made a change to vault_reader.go so that after the initial login to Vault the returned token is periodically renewed in a background thread. The maxiumum TTL of the returned token is also respected so that when reached, a new Vault login is triggered to retrieve a fresh token for subsequent renewals. Would you be happy with this kind of approach?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@georgeharley Sorry I did not reply directly to your question. Yes, I am happy with your approach and in fact it adheres to Vault Go Client library recommendation. I am happy to leave the TTL up to what the server says rather providing one.

@georgeharley georgeharley force-pushed the 239-vault-cluster-authentication branch from a4bef88 to e60c9eb Compare November 14, 2021 15:03
return
}

_, err = login(vaultClient, vaultSpec)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minor issue: When the client is first initialized, it looks like we are logging in twice. Once from InitializeClient() function so that we can fail fast if the credentials are wrong before scheduling the credentials renewal go routing. And then again, within the renewal go routing.

If we wanted to fix it we would have to pass the first vaultLoginResp onto renewToken function and have an index, in the for loop, that tells us whether this is the first time and hence use the vaultLoginResp from the parameter list. Or instead ignore it and login. The code is certainly not as clean as the current one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted and apologies for not mentioning it earlier as it was something that caused me some concern. You are correct that on start up there are two Vault logins in very quick succession so we end up with a log that contains something like the following ...

{"level":"info","ts":1637150459.7343986,"msg":"Authenticating to Vault"}
{"level":"info","ts":1637150459.746867,"msg":"Authenticating to Vault"}

To avoid this I also looked at keeping the login() call out of InitializeClient() altgether and only carrying it out in the renewal thread but that turned worse IMHO since it created a tiny window of time where the Vault client was trying to carry out its work without a valid Vault token and the log output contained a smattering of "unable to retrieve credentials" messages which looked worse. In the interests of trying to get a strawman in front of you sooner rather than later I opted for the former approach since it is not as messy in the logs and only happens on the first reconciliation after startup. I will see if I can take another pass at this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcialRosales I've just pushed a small change to the code which stops the double logins occurring on startup. The InitializeClient() function no longer attempts a Vault login but instead waits for the renewToken() function to notify it of the outcome of the its login attempt. The notification comes via a buffered channel passed to renewToken().

renewToken() only writes to the channel on the first login attempt. If that first attempt fails then it seemed like a good opportunity to fail fast and bail out of proceeding with the whole token renewal loop.

Have tried this on a couple of K8s clusters and now only seeing the one "Authenticating to Vault" log message on startup. In the unhappy scenario where the intial Vault login attempt fails then the "Lifecycle management of Vault token will not be carried out" message gets logged and all subsequent reconcile attempts fail with the Vault error.

Copy link

@MarcialRosales MarcialRosales left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would have been much nicer if vault.Client would have been an interface rather than a struct. That way we could have mock it up rather than mocking/replacing implementation methods such as ReadVaultClientSecretFunc or LoginToVaultFunc.
But I am afraid, it is what we have.

@georgeharley georgeharley force-pushed the 239-vault-cluster-authentication branch from 09d63f9 to 65ba095 Compare November 24, 2021 14:30
}

fakeSecretStoreClient = &internalfakes.FakeSecretStoreClient{}
fakeSecretStoreClient.ReadCredentialsReturns(fakeCredentialsProvider, nil)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@georgeharley When I run make unit-tests this test fails because fakeCredentialsProvider is null at this line which means the fakeSecretStoreClient will return a null CredentialsProvider and test fails. This beforeEach method is invoked before JustBeforeEach method of the Describe.
The behaviour of JustBeforeEach is explained here. It has also trapped me a few times.

One easy way to fix it is by moving the initialization of the fakeCredentialsProvider to the BeforeEach of this test case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When make unit-tests is run in my environment there doesn't appear to be any failing tests. Could be something weird in my setup or else the randomising of the test specs is somehow always falling in my favour? FWIW I'm using go version 1.17.2, Ginkgo version 1.16.5, and GNU Make version 3.8.1 all running on macOS Monterey 12.0.1.

I'll make the suggested fix to ensure the outcome is predictable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test updated.

}

// Reduce load on Vault server in a problem situation where repeated login attempts may be made
time.Sleep(2 * time.Second)
Copy link

@MarcialRosales MarcialRosales Nov 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, this could be replaced by an exponential backoff delay. I'd release it just how it is and if necessary change it to exponential backoff delay.

@MarcialRosales MarcialRosales self-requested a review November 26, 2021 09:23
Copy link

@MarcialRosales MarcialRosales left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@georgeharley thanks for implementing this feature in the messging-topology-operator. We appreciate it a lot.
You have implemented all the changes very quickly.

I would say we are ready to merge it and we can make further improvements when necessary.

@georgeharley
Copy link
Author

Thank you for all of your careful feedback on this @MarcialRosales, it is greatly appreciated and has really improved things.

I have just pushed a further small change which relates to the Vault role that the operator uses when logging in to the remote Vault server. The change is to look for the role name from an environment variable called OPERATOR_VAULT_ROLE set in the operator container specification. This proposed environment variable would be a peer of the existing OPERATOR_NAMESPACE variable (this change sticks with using the OPERATOR_* prefix) and VAULT_ADDR variables. Currently, if OPERATOR_VAULT_ROLE is found to be not set, then the code falls back to using a default role of "messaging-topology-operator" with the log messages clarifying the actual role being used for each login attempt.

@georgeharley
Copy link
Author

If the latest proposed is deemed aceptable then I would be very happy to contribute to the operator installation documentation regarding the Vault support.

@MarcialRosales MarcialRosales self-requested a review November 30, 2021 13:18
return result
}

func login(vaultClient *vault.Client, vaultSpec *rabbitmqv1beta1.VaultSpec) (*vault.Secret, error) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@georgeharley I think I might have found a problem.

We do not want to login to Vault server per reconciliation request hence we login once and we reuse the Vault's authentication token. So far so good, but the problem is that to login with Vault we are first initialising the Vault's client with two settings that comes from the VaultSpec of the first cluster the topology operator has to interact with.

I am not sure if a production environment all RabbitMQ clusters are setup with the same vault.hashicorp.com/namespace and vault.hashicorp.com/auth-path annotations. If all shared the same annotations, we could use the same vault's client for all clusters. Else, we cannot use the same vault's client for all RabbitMq clusters.

Sorry for spotting it too late.

I detected this problem while doing a refactoring of your vault_reader. I was going to suggest a change but I refrained myself from asking you to make refactoring changes after all the work you have put on it. Anyway, the refactoring I was doing were these two:

  • Avoid mocking implementation methods. Instead, mock up collaborator(s) such as Vault client
  • Reduce the amount of global variable/methods and singleton. While removing the singleton I found out the issue I described above.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcialRosales @georgeharley
Interesting nuance....all a part of the journey. If I may try to reiterate the problem and make a suggestion. Apologies in advance for the over explanation, just confirming my understanding.

  1. We have two steps in play for Vault integration. Vault Auth vs Vault credential access.
  2. Vault Auth, utilizing the k8s backend, uses the following components:
    • Vault Address: Established via topology operator environment variable.
    • SA Token: Currently pulled from the topology-operator container.
    • Auth Role: Currently defaults to messaging-topology-operator.
    • Auth Path: Currently pulled from rabbit cluster CR.
    • Vault Namespace: Currently pulled from rabbit cluster CR.
  3. Once auth'ed, the client will request and be granted access to credentials (path sourced from the CR) based on policies setup on the Auth Role.
  4. The Vault Namespace and Auth Path on the CR is related to the Auth Role & SA used for the Rabbit Cluster rather than the Topology Operator.

MVP Suggestion:

It would seem as if we're close to a configuration where the Topology Operator would be able to cache the Vault token. The changes needed would be to transition the Auth Path & Vault Namespace components from being CR sourced to being specific to the Auth Role, i.e.(env's of OPERATOR_AUTH_PATH, VAULT_NAMESPACE).

MVP Assumptions:

  • Same Vault endpoint is used by all clusters.
  • The same Vault Namespace, or no namespace, is used by all clusters.
  • The Topology Operator role will have access to all of the credential paths.

Future Considerations

Not sure if this would be appropriate, however...

Have the Topology Operator leverage all of the components of the Rabbit Cluster to login to Vault. This would require the Topology Operator to pull the token from the SA secret and use it along with the components in the CR. This would be more of an "assume role" type of pattern and remove the need to manage the Topology Operator Role. This would get a bit more involved as we may then find the need to cache the Vault Token per Rabbit cluster?

Copy link

@MarcialRosales MarcialRosales Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twhite0 , I appreciate a lot your clear explanation !

Your MVP suggestion and assumptions makes a lot of sense, specially for the MVP. The change we have to make is minimal. As you summarized it, we only need to source the auth path and vault namespace from an Environment variable along with the role's environment variable. It makes a lot of sense and I think it is more correct.

Using a dedicated role+auth_path+namespace settings for the topology operator has advantages and disadvantages.
The advantage is that you have a separate vault access control policy for the cluster operator and for the messaging topology operator. The disadvantage is the extra administrative burden in Vault.

Your future considerations is really what I had in mind from the beginning. It seemed very logical. A vault administrator sets the appropriate role, auth-path, namespace for a cluster and this configuration is used by both, cluster and topology operators.
The disadvantage is that the access control is all-or-nothing per cluster, i.e. either both operators can operate on a cluster or they cannot. And, as you pointed out, the messaging topology operator would have to cache a vault token per cluster. The topology operator would have to use a bounded cache with max-size and TTL capabilities to limit the amount of tokens to cache.

If both strategies made sense, we could configure the topology operator with additional setting, vault_auth_strategy with 2 values : assume_role where the topology operator assumes all the auth settings from the cluster, or dedicated_role where the topology operator uses its own dedicated auth settings.

@georgeharley
Copy link
Author

@MarcialRosales @twhite0 For your consideration I've just pushed a small change to this PR branch so that the Vault login step looks for the Vault namespace and auth path values from environment variables (OPERATOR_VAULT_NAMESPACE and OPERATOR_VAULT_AUTH_PATH respectively) rather than annotations on the current cluster spec. Effectivelty, this is providing the MVP / dedicated_role strategy described above.

A version of the operator built with these changes in place has behaved as expected when run alongside two separate RabbitMQ clusters. The Vault configuration changes required when the second RabbitMQ cluster was added amounted to the following steps:

  • vault kv put of the new credentials
  • vault policy write creation of a new policy for read access to the credentials
  • vault write update the existing operator role to bind it with the new Vault policy as well as the new cluster's k8s service account and namespace

I would be happy to provide more detailed configuration information or maybe examples as part of the operator documentation if this is the agreed direction.

@twhite0
Copy link

twhite0 commented Jan 3, 2022

Happy New Year all!

@yaronp68 & @MarcialRosales: Would like to gently bump this Draft PR to identify next steps for acceptance.

@MarcialRosales
Copy link

Happy New Year all !
@twhite0 Absolutely . we will do very shortly. Just returned from a long vacation period.

Copy link

@MarcialRosales MarcialRosales left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @georgeharley for making this last code change. We can add the second strategy assume_role when necessary.

The only thing left and given that you have kindly volunteered is to add an example.
I have barely drafted the README.md of the new example below. A suggestion for the example folder could be vault-support.
It is only a suggestion, please feel free to change the structure or even the scope of the test. The intention of the suggested test is to clearly demonstrate how to operate on a vault-enabled cluster in a secure manner (with granted access and without granted access). And also that the operator can operate indistinctly on vault-enabled and k8s-secret enabled clusters.

Here goes the the suggested README.md:

HashiCorp Vault Support Example

The RabbitMQ Cluster Operator supports storing RabbitMQ default user (admin) credentials and RabbitMQ server certificates in
HashiCorp Vault and in K8 Secrets. And accordingly, the RabbitMQ Messaging Topology operator can operate on clusters which are vault-enabled or k8s-secrets enabled.

As explained in this KubeCon talk there four different approaches in Kubernetes to consume external secrets:

  1. Direct API
  2. Controller to mirrors secrets in K8s
  3. Sidecar + MutatingWebhookConfiguration
  4. Secrets Store CSI Driver

The RabbitMQ Cluster Operator integrates with vault using the 3rd approach (Sidecar + MutatingWebhookConfiguration) whereas
the RabbitMQ Messaging Topology Operator uses the 1st approach (Direct API).

Vault-related configuration required

-> Minimum settings required (operator configuration and/or vault configuration) in order for the
messaging topology operator to be able to operate on any vault-enabled RabbitMQ cluster

-> Optional settings we can set in the messaging topology operator (OPERATOR_VAULT_NAMESPACE,
OPERATOR_VAULT_AUTH_PATH, OPERATOR_VAULT_ROLE) what they are for and their default value.

Usage

-> Here we explain how to run the test example and what it does

-> We need two scripts, similar to how we did it for the cluster operator, one for setting up the
prerequisites of the test example and another for the actual test case

-> Briefly explain that first we need run the setup.sh script and what prerequisites it fulfils

-> Then explain the scope of the test.sh.
-> The proposed scope is the following:
-> create a queue in vault-enabled cluster where the operator has been granted access to the cluster in Vault
-> failed to create a queue in a vault-enabled cluster where the operator has not yet been granted access to the cluster
-> create a queue in a standard k8s-secret-enabled cluster. This demonstrates that the two security models can coexist
-> (The script has to deploy three clusters)
-> (The script terminates tearing everything down similar to the test.sh in the examples/vault-default-user in the
cluster operator)

George Harley added 12 commits January 5, 2022 17:25
* Vault client initialized at startup and passed to each
  controller
* Each controller passes Vault client in calls to ParseRabbitmqClusterReference()
  and receives a CredentialsProvider from which to pull username and
  password when accessing RabbitMQ client
* Basic fix up of broken tests
* More unit test code required
* Bumped cluster-operator dependency v1.8.3 -> v1.9.0
* WIP
…ator

* Intent is all controllers will be using the same Vault client when
  reading credentials during reconciliation phase
* WIP
* During each controller's reconcile phase the call to
  ParseRabbitmqClusterReference will be responsible for
  creating a Vault connection and reading credentials which
  then get returned to controller in a CredentialsProvider
* ParseRabbitmqClusterReference function will try and fetch
  credentials from Vault if cluster spec contains a Vault secret
  backend, otherwise will fallback to using K8s Secret in the
  cluster namespace
* More unit testing required
* WIP
* allow optional “vault.hashicorp.com/auth-path” to override
  auth path used when client logs into Vault
* allow optional “vault.hashicorp.com/namespace” to guide
  whether or not client should specify a target namespace
  in the remote enterprise Vault server
* speculative use of these annotations so this is all likely
  to change
* WIP
* Add some additional checks on Vault secret returned from
  authenticated read of Vault path
* To be undone when a suitable role injection mechanism is implemented
* A couple of unit tests have been made pending while the above is in place
* when K8s secrets not being used to supply cluster default user
  creds then cluster reader needs to get the cluster's K8s service
  name from somewhere else; based on observation that K8s service
  name appears to be the same as the cluster CR, going to just
  use the cluster CR name for now
* make existing Vault reader unit test less flakey
* WIP
* CredentialsProvider now has a single operation called Data() that
  returns a named credential attribute (e.g. username, password, etc)
  in a []byte
* Makes the interface more extensible
* WIP
* github.com/hashicorp/vault/api v1.1.1 -> v1.3.0
* github.com/onsi/gomega v1.16.0 -> v1.17.0
George Harley added 5 commits January 5, 2022 17:32
* SecretStoreClient created exactly once as part of first reconciliation
* Vault token owned by SecretStoreClient is continually renewed in a
  background thread that manages its lifecycle
* Lifecycle management thread is cognizant of Vault token's max TTL
  (defaults to 32 days) and will force a fresh Vault login when reached
* Vault logins are only made from the renewToken thread
* renewToken notifies main thread when first login attempt
  has been made by means of a buffered channel that is passed
  into it as an argument
* token lifecycle management only commences if the first Vault
  login attempt succeeds
* small delay added between Vault login attempts if there are
  problems renewing a token (e.g. if Vault server cannot be
  reached because of a temporary problem on the network)
* Ensure fakeCredentialsProvider is properly created before
  accessed in Ginkgo It block
* When authenticating to Vault the operator will use the role specified
  by the OPERATOR_VAULT_ROLE environment variable set on the operator
  container spec (a peer of the OPERATOR_NAMESPACE and VAULT_ADDR
  environment variables)
* If env var not set then default Vault role value of
  "messaging-topology-operator" will be used
* Vault login namespace now found from OPERATOR_VAULT_NAMESPACE env var
  in the operator container. If not set then use default Vault namespace
* Vault login auth path now found from OPERATOR_VAULT_AUTH_PATH env var
  in the operator container. If not set then default to using the
  auth path "auth/kubernetes"
* Above means that Vault login step no longer uses any information from
  the current RabbitMQ cluster spec
@MarcialRosales
Copy link

@twhite0 I am currently adding the OSS docs for the Vault support in the messaging topology operator at https://github.com/rabbitmq/rabbitmq-website/tree/vault-docs-topology-operator

* setup.sh installs Vault to current cluster and configures it
  for the work carried out in the test.sh script
* test.sh goes through a series of steps to demonstrate how the
  operator can use Vault for accessing admin credentials for a
  RabbitMQ cluster
* README.md describes the basic flow of the example as well as
  stating that the RabbitMQ Cluster operator and Messaging Topology
  operators are pre-requisites for this example
@georgeharley georgeharley force-pushed the 239-vault-cluster-authentication branch from b3e8a95 to 09e6360 Compare January 11, 2022 18:54
* Replace sleeps of 30 seconds after the Message Topology operator
  is re-deployed with a function that waits until the operator
  deployment log contains an indication that its initialisation is
  complete and it is ready to handle reconciliations
@georgeharley
Copy link
Author

@MarcialRosales I have added a strawman example under docs/examples/vault-support as suggested. I think it covers the main objectives you were looking for although only two RabbitMQ clusters are involved. Very happy to follow any recommendations or advice for improving this, including changing the whole test scenario if you think it fails to put across the necessary points.

@MarcialRosales MarcialRosales marked this pull request as ready for review January 13, 2022 07:50
necessary for the operator container to have one or more environment variables
set.

- `VAULT_ADDR` should be set to the URL of the Vault server API

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the only mandatory settings that have/must be set, isn't it? the other settings are required if the default value is not applicable

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct. I will ammend the wording in that README to make that clear. Thanks.

Run the [setup.sh](./setup.sh) script to install Vault to the current
Kubernetes cluster. This will also create the necessary credentials,
policy, and configuration in preparation for running the example.

Copy link

@MarcialRosales MarcialRosales Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:

  • it creates the credentials for the cluster cluster-vault-a
  • It creates a policy that grants read access to the credentials of the cluster-vault-a
  • It creates the role messaging-topology-operator which grants the service accounts from the RabbitMQ cluster cluster-vault-a and the messaging-topology-operator to read the cluster credentials.

# bound_service_account_names values follow the pattern "<RabbitmqCluster name>-server”.
# bound_service_account_namespaces values must include rabbitmq-system (where the messaging topology operator is deployed) and the namespaces where the RabbitmqClusters are deployed.
#vault_exec "vault write auth/kubernetes/role/messaging-topology-operator bound_service_account_names=cluster-vault-a-server,cluster-vault-b-server,vault-tls-server,messaging-topology-operator bound_service_account_namespaces=$RABBITMQ_NAMESPACE,rabbitmq-system policies=cluster-vault-a-policy,cluster-vault-b-policy ttl=24h"
vault_exec "vault write auth/kubernetes/role/messaging-topology-operator bound_service_account_names=cluster-vault-a-server,vault-tls-server,messaging-topology-operator bound_service_account_namespaces=$RABBITMQ_NAMESPACE,rabbitmq-system policies=cluster-vault-a-policy ttl=24h"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally prefer to have different roles for the RabbitMQ clusters to access their credentials and a different role for the messaging topology operator. It is especially relevant when we enable TLS in the cluster. In that particular case, the cluster needs write access to the vault's path whereas the messaging topology operator will only ever need read access.

It is more work because we need an extra policy and role defined in Vault and some adjustments to the cluster definitions. But I think it is worth the effort.

Also, I think the vault-tls-server service account is not necessary for this example when we define the role.

loopback_users = none
secretBackend:
vault:
role: messaging-topology-operator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment further down in relation to separating role for the cluster operator and the messaging topology operator

# Create a policy that allows reading of the default user credentials for the RabbitMQ cluster cluster-vault-a
# The path must be referenced from the RabbitmqCluster cluster-vault-a CRD spec.secretBackend.vault.defaultUserPath
echo "Creating Vault policy named cluster-vault-a-policy for reading of cluster-vault-a credentials..."
vault_exec "vault policy write cluster-vault-a-policy - <<EOF
Copy link

@MarcialRosales MarcialRosales Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether it is useful to add a section that discusses authorization strategies.
In this example, as it stands, we are suggesting a strategy whereby each cluster gets its own policy and role. It offers a great deal of security but i am not sure if it is scalable.

Another approach is to use a single policy and role for a group of clusters using wildcard in policy's paths. For instance, have a policy which grants access to the path secret/data/rabbitmq/finance-group/*. A cluster whose name is "orders" would retrieve its credentials from this path secret/data/rabbitmq/finance-group/creds. And all the clusters in that group must use the same role, e.g. finance-group-role. Any time we create a cluster, we have to include the clusters service account in the role's bound_service_account_names` attribute.


# Update the Messaging Topology operator deployment, providing it with the Vault related environment variables
echo "Updating Messaging Topology operator to work with Vault..."
kubectl set env deployment/messaging-topology-operator -n rabbitmq-system --containers='manager' OPERATOR_VAULT_ROLE=messaging-topology-operator VAULT_ADDR=http://vault.default.svc.cluster.local:8200

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor issue: VAUILT_ADDR assumes we ran setup.sh from the default namespace.


- `VAULT_ADDR` should be set to the URL of the Vault server API
- `OPERATOR_VAULT_ROLE` should be set to the name of the Vault role that is used when accessing credentials. This defaults to "messaging-topology-operator"
- `OPERATOR_VAULT_NAMESPACE` may be set to the Vault namespace to use when the Messaging Topology operator is authenticating. If not set then the default Vault namespace is assumed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion: add a link to the Vault namespace so that it is not confused with kubernetes namespaces.

set.

- `VAULT_ADDR` should be set to the URL of the Vault server API
- `OPERATOR_VAULT_ROLE` should be set to the name of the Vault role that is used when accessing credentials. This defaults to "messaging-topology-operator"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The setup.sh script creates this role with the name messaging-topology-operator

Copy link

@MarcialRosales MarcialRosales left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @georgeharley for your contribution ! Greatly appreciated
I have only small suggestions to make to your last changes.
Let me know if you are ok with then and we can proceed with the merge.

* Improve description of setup and test in README
* Have the RabbitMQ cluster use a separate Vault role and policy
  to that used by the Messaging Topology operator
* Cater for the setup and scripts being run in Kubernetes namespaces
  other than default
@georgeharley
Copy link
Author

Thank you for the detailed feedback @MarcialRosales which definitely improves things. I have just pushed a further commit now that I hope addresses your suggestions 🤞

Copy link
Contributor

@ChunyiLyu ChunyiLyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thank you for contributing!

@MarcialRosales
Copy link

MarcialRosales commented Jan 19, 2022

@georgeharley We noticed that most of integration-tests are failing because the default-user's k8s secret has no username and/or password however we are validating them here (https://github.com/georgeharley/messaging-topology-operator/blob/239-vault-cluster-authentication/internal/cluster_reference.go#L115). Before this PR we only validated these fields in the RabbitMqClient implementation. However, the controller test cases only use a fake implementation of the RabbitmqClient hence the test cases are not affected by the fact that default-user's k8s secret has not username/password on it.

We prefer to keep the validation in one place, in the RabbitmqClient rather than in two places. Of course, an alternative would have been to modify suite_test.go to fill up the default-user's k8s secret with a username and password. But we prefer the former.

Is that ok? It is just a matter of removing the username and password validation if-statements from here.

We prefer to merge after the whole pipeline is happy.

* Instead of throwing errors if the cluster K8s secret does not contain
  username and/or password values just log the findings
@georgeharley
Copy link
Author

@MarcialRosales that sounds fair enough. Just pushed an update now to log if the credentials are not found in the secret rather than raising errors. Hope that approach is OK?

@MarcialRosales MarcialRosales merged commit e8b60a7 into rabbitmq:main Jan 20, 2022
@georgeharley georgeharley deleted the 239-vault-cluster-authentication branch January 20, 2022 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants