Requirement: Bind ten thousand domain names to the website and automatically generate HTTPS certificates.

Solve the problem of clickbait titles first#

There are some differences between actual needs and titles, but it is difficult to describe the complete needs in a short title. Let's solve the problems related to this title first.

If there are 10,000 known domain names and you want to issue certificates to them in batches: for example, there is already a domain name list (which sounds a bit gray). Then what needs to be done is:

Use a script to batch perform DNS resolution for them
- For example, resolve them all to IP w.x.y.z
Start a service on w.x.y.z to perform HTTP Challenge
- Start a cert-manager, feed it with the 10,000 domain names, and let it start issuing certificates
The gateway listens on port 443 and attaches the public and private keys generated by cert-manager
- Depending on the functionality of the gateway, if the gateway requires each domain name to be clearly stated, use a script to automatically generate 10,000 routing configurations

It may not be elegant, but "it's not like it can't be used.jpg"; after all, it is a one-time task, and it is safer to complete it offline if possible.

Real Needs#

The real needs are:

xlog is a writing platform based on the crossbell chain
Users can generate custom subdomains such as jeff.xlog.app
And, users can bind their own domain names to their homepages
- For example, x.jeff.wtf
After users have bound and pointed their DNS to xlog, HTTPS certificates can be automatically issued, and the entire process is accessed via HTTPS.

cert-manager❎#

First, let's take a look at the old method, cert-manager.

Following the above train of thought, a straightforward idea is:

When the user completes the binding operation on the xlog.app panel, send the domain name to cert-manager to start issuing the certificate.
After obtaining the certificate, modify the gateway configuration and reload it (depending on the gateway used).
When the user visits, the gateway already has the signed certificate, so it can directly establish an HTTPS connection.

Apart from a small coupling between business (xlog) and infrastructure (cert-manager), there doesn't seem to be any major issues.

What's wrong?#

The problem lies in the timing of issuance in this solution:

If the user has not completed the DNS resolution, cert-manager cannot pass the HTTP-Challenge.
How does cert-manager know that the domain name has been resolved?
- The simplest answer is: if a request with Host: the domain name to be bound is sent to the xlog address, we consider this as the user's first visit, and the resolution has been completed.
  - (Although it is easy to fake, we won't lose anything)

Therefore, the timing of issuance can only be when the user makes a request. If we wait until the request arrives, we can do the following:

Send the domain name to cert-manager for certificate issuance
After issuing the certificate, modify the configuration to let the gateway reload
- If there are multiple gateways for high availability, wait for all gateways to finish reloading

This means that the first few (or more) requests from the user will definitely fail. Even with some small optimizations such as attempting certificate issuance in a timed loop, this time is not very controllable.

Why can only the gateway initiate certificate issuance?#

As a supplement to the above, consider the following scenario:

The user binds their own domain name on xlog
But they haven't changed the DNS resolution
Until one day, they suddenly remember and proceed with the resolution
After the resolution takes effect, they visit the website, and at this time, an HTTPS connection should be established normally

If a solution like "periodically attempt to issue certificates" is used, a large amount of resources will be wasted. Moreover, this is a vulnerability that can be easily attacked: as long as I keep binding domain names without resolving them, the server will have unlimited resources wasted.

Therefore, the triggering timing for issuing certificates must be when the gateway receives a request with this domain name as the Host for the first time.

Traefik❎#

Additional explanation:

Although the first web server that comes to mind when it comes to automatically issuing HTTPS certificates is usually Caddy, we are using Traefik as our gateway for the following reasons:

Traefik natively comes with a Kubernetes Ingress Controller, which naturally supports k8s.

Traefik-Mesh can easily achieve Service Mesh within a k8s cluster.

From the day we started researching k8s gateways to the day of writing this article, the Ingress Controller of Caddy is still a work in progress. If you want to use it in a k8s cluster, you either have to use the work in progress version or develop your own Ingress Controller.

Supplement for time travel: At the time of writing this article, the version of Traefik is v2.8.3.

Coincidentally, Traefik has an automatic certificate issuance feature. So let's first investigate whether Traefik meets this requirement.

The setting for Traefik's automatic certificate issuance is as follows, after enabling it:

Users write routing rules (IngressRoute), and Traefik reads the configuration tls.domain or the Host part that matches the rules.
Traefik automatically attempts to issue and renew certificates based on the IngressRoute.

According to the above setting, what we need to do is:

After the user binds the domain name on xlog, create an IngressRoute.
Wait a minute, something seems wrong! Isn't this logic the same as sending it to cert-manager?

So let's try another approach:

We write a Traefik Middleware that automatically creates an IngressRoute when a domain name is resolved and accessed by Traefik for the first time, in order to issue a certificate.

Although it is a bit awkward and convoluted, it seems to be a perfect solution in terms of functionality, with only a few (perhaps) tolerable drawbacks:

We need to write a Middleware and run a service.
- The logic is roughly: check if the domain name has been bound -> create IngressRoute.
- If we want to use the gateway to securely issue certificates automatically, this kind of logic is unavoidable.
At least the first visit is still a failure or non-HTTPS.

What's wrong?#

The problem lies in the fact that Traefik's automatic certificate issuance feature can be seen as a toy:

In this documentation (which cannot be found unless you read it line by line), it is stated that Traefik 2.0 is designed to be a completely stateless service, and multiple Traefik instances do not share anything. Therefore, if you want to manage certificates, it must be a single point (or you can spend money to buy the enterprise version).
- This paragraph is not under the headings of TLS / Let's Encrypt / Kubernetes and Let's Encrypt, but in the documentation introducing IngressRoute.
- This is also the reason why it is quite difficult to calculate Traefik's ratelimit. You must know how many Traefik instances are running in the cluster and calculate it based on probability.

Why should we spend money on a solution that is so awkward?

Caddy✅️#

Finally, it has to be Caddy.

It has two modes of automatic certificate issuance:

The first mode is the commonly used one, which explicitly specifies the domain name:
1. The domain name must be resolved first.
2. Then start Caddy and write the domain name to listen to in the configuration file.
3. After starting, Caddy will immediately issue the certificate, and users can directly access it.
The second mode is an on-demand mode:
1. As long as it is enabled, it will issue a certificate for each domain name that comes.

We need the second mode, which has the following drawbacks:

The first visit will be slower (because it needs to issue a certificate).
There is a security risk, and it is easy to become an entry point for attacks.
- Therefore, Caddy requires that in a production environment, an ask configuration should be provided to query an HTTP interface to determine whether the certificate should be issued.
By default, Caddy (currently v2.5.2) needs to configure a persistent data directory for each instance, which means that it must also be a single point in the default storage configuration.
- Fortunately, there are third-party storage plugins that allow multiple Caddy instances to read and write to the same storage.

To solve problem 2, we need to write a simple HTTP service to verify the domain name (whether it has been resolved and bound, etc.); to solve problem 3, we must compile Caddy ourselves:

# This plugin, caddy-tlsredis, will store data in Redis, allowing multiple Caddy instances to share the same storage for high availability
# You can directly use the pre-built image here: https://github.com/sljeff/caddy-tlsredis-docker
xcaddy build --with github.com/gamalan/caddy-tlsredis

Then our Caddyfile will look like this:

{
        storage redis {
                # Change the storage to Redis, this can be left empty and overridden by environment variables
                # See https://github.com/gamalan/caddy-tlsredis for details
        }

        on_demand_tls {
                # This is our verification service, which can be deployed together with each copy of Caddy
                ask http://localhost:5000/
        }
}

:80, :443 {
        tls {
                # Automatically issue certificates on demand
                on_demand
        }

        # This is the actual upstream service
        reverse_proxy 127.0.0.1:3000
}

This way, a certificate is issued for each domain name that comes; the drawback is that the first visit will be slower.

One last small optimization#

Let's review our requirements, which include:

Users can generate custom subdomains such as jeff.xlog.app

This means that all users will generate a subdomain. If we issue a separate certificate for each subdomain, it seems to be a bit wasteful.

A more reasonable approach is to issue a wildcard certificate for *.xlog.app. However, a wildcard certificate requires DNS challenge (to prove ownership of the entire domain name), so we need to update the Caddy configuration:

# Need to add the DNS provider plugin for xlog.app to the compilation, so that we can issue and renew wildcard certificates
xcaddy build --with github.com/gamalan/caddy-tlsredis --with github.com/caddy-dns/cloudflare

# Add this section in the middle of the Caddyfile to match the rest and redirect to :80, :443

xlog.app, *.xlog.app {
        tls {
                dns cloudflare {env.CF_API_TOKEN}
        }

        reverse_proxy 127.0.0.1:3000
}

Finally, the work is done.