Troubleshooting the cert-manager service for Kubernetes
This is a blog post on how to troubleshoot cert-manager, the Kubernetes add-on that automates and manages TLS certificates.
What is cert-manager?
Before we can start troubleshooting issues, first we need to discuss the software that we’re using. Cert-manager is the next step in the kube-lego project, which handles provisioning of TLS certificates for Kubernetes. Basically, it takes away the manual work of requesting a cert, configuring the cert, and installing the cert. Instead of working directly with Nginx, we can describe what we want configured. Then, the rest is taken care of automatically with ingress resources and the ingress controller. Cert-manager configures new Kubernetes resource types that can be used to configure cerficiates – Certs and Issuers. There are two kinds of issuers, ClusterIssuer and Issuer, which have different scopes. A ClusterIssuer manages certificates for the entire cluster. However, in this example we are using an Issuer, which controls only a single namespace.
For a more detailed overview of cert-manager, check out their GitHub project page: https://github.com/jetstack/cert-manager
Where are the examples running?
For this blog’s troubleshooting demo, I’m using the IBM Cloud Kubernetes Service (IKS). IKS is IBM’s Kubernetes offering. It provides a great version of Kubernetes, which makes it great for testing deployments and new features or projects that extend K8s. For the purposes of this blog, I am using a paid-tier version of IKS simply because the free tier doesn’t allow us to port 80/443. The free tier is also limiting in that we can’t use load balancers. (For full details and pricing plans, visit https://cloud.ibm.com/kubernetes/catalog/cluster.)
The new cert-manager project supports more ingress controllers. Kube-Lego was limited in supporting different ingress controllers. The biggest difference that I can see between Kube-Lego and Cert-Manager is how the ingress resources are configured. In Kube-Lego there would be at least two ingress resources per domain, which would break certain ingress controllers as they were not expecting more than one resource per dns record.
The application was deployed by using HTTP validation. There are example services and applications that can be used if you want to try this out in the example. Troubleshooting assumes the steps in the documentation have already been followed.
Most of the common issues that we see with Kubernetes seem to come from slow DNS resolution. If you are configuring an A record for your domain around the same time as deployment, then you might run into issues when Let’s Encrypt attempts to verify the domain. If the domain is not resolving yet, then we can assume that the challenge file is not reachable. Verify that you can resolve the DNS record before you attempt to set this up.
Great, but what does that mean and why do I care? We need the resolution to work because Let’s Encrypt is going to issue a challenge to make sure that the domain actually exists and that it wants to be configured by Let’s Encrypt. Basically there’s a challenge file that needs to exist in a specific location and is being served on port 80. If it exists, then Let’s Encrypt will progress. If the DNS is not configured correctly, or it hasn’t resolved, then Let’s Encrypt is unable to resolve the domain and will also fail at finding the challenge file.
Since we are using IKS, we’ll already be set up with an ingress controller and ingress resource by default. When setting up DNS, we want to use the IP address that is associated with the ingress controller and load balancer service that was configured. Where can we find this valuable information? It’s going to be in the kube-system namespace.
kubectl get svc -n kube-system |grep -i "public"
We see output similar to:
public-crf3df42c3c8a142c8a3e0ee73ed4e58e2-alb1 LoadBalancer 172.21.39.18 22.214.171.124 80:31337/TCP,443:31615/TCP 106d
We need to pull out the public IP (in this case it’s 126.96.36.199) and use that to set up an A record for the hostname we are using. The great part about having an ingress controller that’s already configured on the cluster is that we can manage multiple domains. In this demo, I set up multiple domains to point to the same IP address and then used ingress resources + cert-manager + the ingress controller to manage traffic resolution based on the hostname. When the DNS record finally resolves, you can move along and attempt a deployment.
lp.mpetason.com has address 188.8.131.52
First, we need to see which ingress resources were created with the command below.
kubectl get ingress
Note: If we are checking in a different namespace, then we need to append -n NAMESPACE_NAME.
Find the name of the resource that was recently created and then describe it.
kubectl describe ingress INGRESS_NAME
Check for valuable information in Events. Normally, we’ll see something like “failed to apply ingress resource” in the message field, and if we check the “Reason” field we’ll actually get a useful error message. This is great for sysadmins and developers since it means that they get useful information without having to look at log files on an actual server.
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning TLSSecretNotFound 3s public-cr0ba8157fd1a6454ca7ba3125b9b44ff6-alb1-5895555f68-bl976 Failed to apply ingress resource. Warning TLSSecretNotFound 3s public-cr0ba8157fd1a6454ca7ba3125b9b44ff6-alb1-5895555f68-25nhq Failed to apply ingress resource.
After we figure out that the TLS Secret might be missing, we need to see what the expected resource is named. We name the secret in our ingress resource, so let’s check there first.
kubectl get ingress INGRESS_NAME -o yaml
We use the output option to specify yaml so we can read the configuration file. You can also use describe instead of using “get” with “-o yaml”, we’ll just see the output in a different format. In our case, our secret name is lp-mpetason-com-tls1.
tls: - hosts: - lp.mpetason.com secretName: lp-mpetason-com-tls1
Check for configured secrets to see if the secret is configured, or if it has a different name for some reason. If we are having trouble with our deployment, then it may not have been created. In order to get the file created we would need for the Issuer and the Cert to finish getting configured.
kubectl get issuer kubectl describe issuer ISSUER_NAME
We should be able to find the error message in Events. Most of the error messages about the Issuer are related to the acme endpoint. There might be other issues that can come up, however I haven’t seen them enough to help troubleshoot – yet. For the most part, you can try to resolve the issues you see in the Event info or Status.
If our issuer is working without issues, we will see something like the following:
Status: Acme: Uri: https://acme-v01.api.letsencrypt.org/acme/reg/<NUMBERS> Conditions: Last Transition Time: 2018-06-14T18:12:24Z Message: The ACME account was registered with the ACME server Reason: ACMEAccountRegistered Status: True Type: Ready Events: <none>
As of this post, we should probably use acme-v02 instead. If you run into errors about the version, go ahead and change it.
Next, we need to take a look at the cert and see what the status is.
kubectl get cert kubectl describe cert CERT_NAME
Here we can run into a few other issues, such as rate limiting, if we tried to register a lot in a short period.
Normally, if the issuer is working and DNS is resolving, we should be able to get a cert. After we confirm that we have a cert via the Describe on the cert, we’ll need to take a look at secrets to verify that it was created.
kubectl get secret
If the secret exists we can go back over to the ingress resource to see if the ingress controller was able to load our cert.
Warning TLSSecretNotFound 26m public-cr0ba8157fd1a6454ca7ba3125b9b44ff6-alb1-5895555f68-bl976 Failed to apply ingress resource. Warning TLSSecretNotFound 26m public-cr0ba8157fd1a6454ca7ba3125b9b44ff6-alb1-5895555f68-25nhq Failed to apply ingress resource. Normal Success 11s public-cr0ba8157fd1a6454ca7ba3125b9b44ff6-alb1-5895555f68-25nhq Successfully applied ingress resource. Normal Success 11s public-cr0ba8157fd1a6454ca7ba3125b9b44ff6-alb1-5895555f68-bl976 Successfully applied ingress resource.
Success! Now we can hit the site and verify that