Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Unable to provision new TLS certificates across 2i2c hubs

FieldValue
Impact TimeMar 31 at 16:00 to Mar 31 at 17:16
Duration1h 15m 39s

Overview

Issuance of TLS certificates was broken on at least two clusters following a recent migration to nginx-ingress. This was caused by a change to the behaviour of multiple ingresses sharing the same host.

What Happened

An attempt to launch a new hub failed after the ingress failed to obtain new TLS certificates. This was seen recently on another hub, and thus an incident was declared.

Resolution

Annotating Ingress objects that need TLS to permit cert-manager to mutate the ingress via edit-in-place.

Where We Got Lucky

We already learned about this possible cause when testing an internal cluster.

What Went Well

Fixing the cert-manager configuration was quick!

Action Items

Timeline

TimeEvent
4:04 PMEngineer adds context for manually triggering an incident Angus Hollands #jupyterhub_u_inc_2060 Yesterday, I recreated the support deployment on 2i2c-aws-us. There, I saw that TLS certificate orders were stuck pending. I saw 301s in the nginx-ingress logs, which I suspected were http → https upgrades. My first query was whether this behaviour was a new default with e.g. SSL redirect. My investigation of the various documentation for nginx-ingress and ingress-nginx suggested that this is not the case. Then, I looked deeper at the mechanics. To satisfy the ACME issuer flow, the certmanager has to create a temporary ingress. With further investigation, I saw evidence that the “proper” ingress (e.g. the Hub ingress) was taking priority over the transient certmanager ingress, likely leading to the 301 we see.
4:05 PMAngus Hollands #jupyterhub_u_inc_2060 I am now seeing this in another cluster, awi-ciroh, where I’ve just created a new deployment (redeploying a decommissioned workshop hub) that has the need for new TLS certificates.
4:06 PMAngus Hollands #jupyterhub_u_inc_2060 I suspect this will happen across many clusters. I note that we recently redeployed the JS2 cluster with this new ingress, and I did _not_ see evidence of this. But I’ve opened an incident for now to cover at least 2i2c-aws-us and awi-ciroh.
4:43 PMEngineer explores using the master-minion ingress model Angus Hollands #jupyterhub_u_inc_2060 I tried using cert-manager annotations for editable ingresses last night. I think that didn’t quite work either. The alternative is master-minion ingresses, and given that both approaches require us to annotate the primary ingresses, I think we should use the latter in preference to mutating ingress objects.
4:53 PMEngineer reports that master-minion is more invasive than desired. Tries editable ingresses instead. Angus Hollands #jupyterhub_u_inc_2060 OK, that was a no-go. Although this provisioned the certs properly, it looks like the minionmaster ingress mechanism requires that the paths are set in the minion ingress. I.e., we’d need to actually partition our ingresses into two. I don’t think that’s very easy to do in an agnostic way, so I’ll try editable ingress objects. INCIDENT #2060 Unable to provision new TLS certificates across 2i2c hubs