Project Pythia BinderHub cluster unable to build images
| Field | Value |
|---|---|
| Impact Time | Mar 3 at 18:01 to Mar 5 at 22:21 |
| Duration | 2d 4h 19m 45s |
Overview¶
The Project Pythia BinderHub on JetStream2 was unable to build new images, and subsequently was recreated
What Happened¶
Certificates expired on the OpenStack cluster, rendering the BinderHub unable to schedule new build pods.
An attempt to upgrade the Kubernetes template was made, which is the recommended mitigation for this. This step failed, rendering the Kubernetes cluster entirely inoperable.
Resolution¶
The cluster was recreated, and the hubs redeployed, in order to re-establish working credentials and a working BinderHub.
What Went Well¶
We had context from Julius (Jetstream2) about the soon to be expired certificates, so we didn’t have to figure that part out ourselves
What Didn’t Go So Well¶
We didn’t upgrade the cluster early enough for the capi certificates to not expire
The dependency between the Helm chart version for capi-helmcharts and the Kubernetes template was not identified before the upgrade.
The engineering team took longer than usual to respond to this.
Action Items¶
Timeline¶
Mar 5, 2026¶
| Time | Event |
|---|---|
| 3:05 PM | First cluster recreation attempt, engineer waits Angus Hollands #server_start_inc_1924 Create still in progress. |
| 7:02 PM | Engineer attempts another cluster creation after determining proper CAPI version Angus Hollands #server_start_inc_1924 OK, trying a new TF deployment. I’m expecting this to take several hours based upon the docs. |
| 7:05 PM | Engineer reports failure to upgrade in attempt to fix credentials Angus updated the status of the incident through the website. Message: The authentication certificates for this cluster expired. Once we caught this, we went to upgrade the cluster which is the suggested mechanism for renewing them. This upgrade failed, requiring us to re-create the cluster. This is a slow process, and we’ll update once there’s more news. [FIRING:1] At least two servers failed to start in the last 30m projectpythia-binder binderhub 3 (kubeconfig immediate action needed) |
| 10:21 PM | Engineer reports success of cluster recreation Note added by Angus. Resolution Note: We recreated the cluster, and have confirmed that we can spawn Binder instances that trigger scale up. [FIRING:1] At least two servers failed to start in the last 30m projectpythia-binder binderhub 3 (kubeconfig immediate action needed) Description: [FIRING:1] At least two servers failed to start in the last 30m projectpythiabinder binderhub 3 (kubeconfig immediate action needed) (View Message) [FIRING:1] At least two servers failed to start in the last 30m projectpythia-binder binderhub 3 (kubeconfig immediate action needed) Engineer in UK timezone starts to investigate error Acknowledged by Angus through the website. [FIRING:1] At least two servers failed to start in the last 30m projectpythia-binder binderhub 3 (kubeconfig immediate action needed) |
| 5:19 PM | Engineer cannot access Kubernetes API server Angus Hollands #server_start_inc_1924 I cannot connect to the JS2 cluster due to a certificate error, and the same is true of spawning jobs in the cluster itself. |
| 5:39 PM | Engineer observes previous update Angus Hollands #server_start_inc_1924 Looks like it was updated yesterday and is not healthy: 10:21 PM [FIRING:1] At least two servers failed to start in the last 30m projectpythia-binder binderhub 3 (kubeconfig immediate action needed) |