Automating Internal Certificate Issuance With ACME-based Certificate Authority

Posted Dec 30, 2021

This blog is about the certificate authority (CA) that we brought to life in Deutsche Telekom Pan-Net and were running for a few years as a virtual machine (VM) deployment. We have recently moved to a Kubernetes and I cleaned up old deployment consisting of around 40 VMs that still lived in the OpenStack. I took this cleanup procedure as an opportunity to reflect on what we actually did there.
To lower the subsequent operational burden, we have decided to use a certificate authority that leverage Automated Certificate Management Environment (ACME) protocol for automated certificate issuance and renewals. The most famous CA that use the ACME protocol is Let’s Encrypt Boulder and it was the only one with accessible source code, when we started doing proof-of-concepts in 2018. Which is no surprise as the Let’s Encrypt is behind the ACME standard itself.
I am in a no way expert on the Boulder code-base so take a description of the software in this post with a grain of salt. The second important fact is, that the Boulder version I am familiar with is an old version back from the 2019.

Architecture

The Boulder certificate authority has a microservice architecture consisting of multiple services, each handling critical part of a job. Those services are communicating with each other via mutually authenticated, TLS secured gRPC tunnels. That means gRPC itself needs it’s own certificate authority that can sign certificates for each Boulder component. For testing, Let’s Encrypt engineers are using minica, which is easy to use private certificate authority in scenarios, where all the services that needs TLS certificate are under your operation.

Only two Boulder components need inbound connection, those are Web Front End API and the Online Certificate Status Protocol (OCSP) endpoint. Publisher component needs outgoing connection to publish data to a certificate transparency (CT) logs, but this depends on your actual needs. To run a CA you must also have a SQL database, some kind of hardware encryption module and external CT log API available, so you can publish signed certificates for external audits. Of course, if you are running internal CA you doesn’t need a real CT log API, but the software stack expects it to be available, otherwise issuing is not possible without the source code changes.

Components:

Web Front End
Exposes the ACME API to the customers - certbot or alternative clients. Two WFE components can coexist, each supporting different ACME API versions, version 1 or 2.
Hardware Secure Device
Hardware Secure Device (HSM) is the encryption device that protects the cryptographic material and expose the functionality via the PKCS#11 interface.
SoftHSM is used as a flexible replacement for a physical HSM during the Boulder software development process.
Certificate Authority
The fundamental component that verify if the public key match security requirements, then signs x509 precertificates and certificates. The certificate authority component also signs all the OCSP responses.
The CA component communicates with the HSM via PKCS#11 interface.
Registration Authority
Registration Authority processes user account registration and the further account updates. It also creates authorization objects and challenges. It calls a CA component to fulfill signing requests.
Validation Authority
Performs the actual validation of domain control and notify the Registration Authority about the result. The VA component uses HTTP protocol or DNS TXT lookup to verify that the challenge delivered to a certificate requester was fulfilled.
Storage Authority
A data storage service for all Boulder components that have storage requirements. It talks directly to a database.
Database
Relational SQL database. As a backend, we have used MariaDB Galera Cluster.
OCSP Responder
Exposes the webserver able to respond to Online Certificate Status Protocol (OCSP) queries from the clients.
OCSP Updater
Updates state of revocation of the certificates stored in the database.
Publisher
Publishes issued certificates to the Certificate Transparency Log.
Certificate Transparency Log
This is not a Boulder component, however, in order to make Boulder work you need some API that will talk “like” an actual certificate transparency log. If you want to deploy Boulder internally, you can fake the log by using dummy CT log service ct-test-srv which is included in the GitHub repository.

Deployment and Operation

Main operations include the operating system (OS) upgrades and Boulder binary upgrades. Another important operation is an actual key management, where you need to rotate certificate of the certificate authority itself.
We have used Terraform OpenStack Provider to deploy our VMs. We have used blue/green deployment strategy so the OS or Boulder upgrades became routine procedure. Provisioning would create a new blue/green VM deployment with completely new OS image and software version, then pipeline test component availability and an actual certificate issuance. If everything works, we would switch the active color on a load balancers and subsequently destroy the old infrastructure. The Boulder code base is developed in a professional way, where each new software version supports config options from the previous one. In this way, each upgrade can be easily rolled back if some problem happens. Database migrations are also released in a way, where the current migration just extends tables in a way that the previous Boulder version can still use such SQL schema. How LE engineers handle code changes can be learned by reading the contribution guide.

For the CA key lifecycle, more planning is required. During the change, you need both, old key and the new key available, as the OCSP endpoints still need to issue valid statements about the leaf certificates issued by the previous CA key. The easiest method is to have separate OCSP service using its own DNS address for each CA key. You also need to maintain required components, like the CA, for each key.

Closing Thoughts

Although the Let’s Encrypt Boulder is a production software, I do not recommend using it for an internal deployment, unless you seriously consider allocating full time engineers to deploy and maintain the project. The reason is that the Boulder is primarily a code base for the Let’s Encrypt public certificate authority and as such, development moves towards targets defined by the Internet Security Research Group. You can’t expect fast support except for the cases where you find a critical bug.
These days, there are alternatives that are better documented and commercially supported like the Smallstep ACME Registration Authority, Primekey EJBCA and maybe others, where you does not need to watch for the code changes and read the commit messages.