This blog post is a short version of my presentation over at 21. stretnutie komunity ohľadne technológií Cloud Native. You can find the slides here
If you are developing and maintaining applications, at some point you have to rebuild infrastructure, change passwords used between the services, then share those new credentials. Alternatively, you have to revoke access if your stuff got compromised. The common sensitive material, which engineers use daily, includes:
- API keys or tokens
- SSH credentials
- Encryption keys (SSH, TLS/SSL)
We usually call those sensitive material secrets. The secret is any confidential information used for authentication, authorization or encryption. Some examples from the real world deployments:
- Webserver could use the TLS certificate and the corresponding private key, database credentials or the API keys to contact other remote services.
- Radius server could use shared secret with the clients, network devices, that are connecting to it.
- Database could use master credentials like user & password. Symmetric key used for data encryption at rest could be used as well.
If you are using a GitOps operational framework, secrets are usually loaded by a pipeline that is used for application or infrastructure deployment. Storing secrets in a git repository, even encrypted using PGP, git-crypt, ansible-vault or alternatives, as is often the case for small teams, does not scale with many people working on a project. It’s hard to audit who accessed the secrets or revoke the access for people that are no longer on a project. Hence, you have to rotate secrets all the time.
The sane approach, that is being used in the industry, is to have secrets tightly controlled in a single source of truth, loaded over the network from a secret management solution.
Every deployment can be easily on-boarded to this workflow once in place.
When deciding about the appropriate solution to our use case, let’s write down requirements for a secret store solution:
- Single source of truth in the form of highly available database
- Provides API interface for programming access
- Encryption at rest
- Detailed auditing - who accessed what & when
- Ability to revoke access
- Multiple authentication methods that could be mapped on the existing authentication workflow
To name a few secret stores that are being used publicly:
We at Deutsche Telekom Pan-Net are using HashiCorp Vault Open Source internally, so the rest of this blog is mostly about this solution.
I often use Vault with the Gitlab pipeline, where authentication to Vault takes place and either secrets are loaded from the Vault key/value back-end to be used in the pipeline directly, or Vault token is exposed as an environment variable, which is later picked up by the Ansible. Ansible roles than expose the secrets as a variables for a later use.
When talking about the Vault key/value back-end, basic usage can be found in the HashiCorp documentation.
Ansible roles could be found on the Pan-Net GitHub page:
After 4 years of operating the secret store, I can say that the biggest pain point for our use case was the missing self-service. It’s quite easy to use most of the basic functionality Vault provides. Reading or storing data are just a few API calls. But enrolling new users is not trivial. Even when Vault support various identity providers and we are able to integrate it with the identity management solution, users have to be authenticated against the specific roles that has to be provisioned to Vault up-front. Otherwise, they would be able to cross access all the secrets or none at all even after successful authentication.
The Vault operators must create various types of access control policies and map them to those roles.
This does not scale automagically, so we had to develop self-service that takes care of policy and role provisioning on top of the Vault.
More information about the self-service we built at Pan-Net could be found in the Pan-Net Collection of CI/CD security best practices.
Limitations of the Vault Open Source
You will find out the limitation once you want to bulletproof your production deployments. Some of the bottlenecks that I came over are:
Georedundancy is an enterprise feature, which kinda make sense from the business perspective. You can either automate cluster snapshot restore to a disaster recovery site, or split single cluster multi-region.
There is no PKCS #11 support in any encryption back-end, which means that you could not use hardware security module (HSM) for the encryption operations. Most affected back-ends are
transit back-end, which is used to encrypt/decrypt/sign/verify data via an API, and the
PKI back-end that is full blown API based certificate authority. I have opened feature request for the PKI one, but if I am HashiCorp, I would ship it as an enterprise feature…
Unsealing can’t be easily automated due the previously mentioned limitation. Not everybody is using HSMs as they are usually expensive, but it’s a killer feature for the auto-unsealing. You can use key management systems from popular cloud providers instead, even if you are on-premise, but if you don’t want to integrate with a 3rd party, you are on your own. Transit back-end is supported for auto-unsealing, but it’s a chicken and egg situation if we are talking about the first cluster. I saw solutions that automated unsealing by distributing key shares between multiple different virtual machines that monitored the cluster health.
Audit settings are not propagated in HA mode. When you set-up audit back-end in a highly available cluster, you have to do so on each node separately. To my knowledge, audit back-end settings are not propagated within a cluster in the open source version of Vault.
Advise if you are deploying your own Vault solution
Secrets management usually reflects the organizational structure, so I recommend to start with the authentication flow. If you are using Gitlab, Github, some popular cloud provider or your own identity provider like Keycloak, you can map identity lifecycle to this solution, and you have half of your work done. Creating roles for authorization is a different story, but it can be prototyped quite easily.
Use dynamic secrets engines for the new infrastructure. If you are managing database users, move the user creation to Vault. If you are using SSH access to a virtual machines, let the Vault be your SSH certificate authority or one-time SSH password issuer.
Use policy templates so you don’t have to deal with many policies. Just properly define them with the identity or group metadata.
Use programming libraries like Python hvac for Vault setup automation.
Despite the limitations I mentioned earlier, I think HashiCorp Vault is a robust and stable solution and I can definitely recommend it.
Update - corrected statement about using cloud based auto-unseal on-premise. Thanks to the commenter.