Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Project Pythia BinderHub cluster unable to build images

FieldValue
Impact TimeMar 3 at 18:01 to Mar 5 at 22:21
Duration2d 4h 19m 45s

Overview

The Project Pythia BinderHub on JetStream2 was unable to build new images, and subsequently was recreated

What Happened

Certificates expired on the OpenStack cluster, rendering the BinderHub unable to schedule new build pods.

An attempt to upgrade the Kubernetes template was made, which is the recommended mitigation for this. This step failed, rendering the Kubernetes cluster entirely inoperable.

Resolution

The cluster was recreated, and the hubs redeployed, in order to re-establish working credentials and a working BinderHub.

What Went Well

What Didn’t Go So Well

Action Items

Timeline

Mar 5, 2026

TimeEvent
3:05 PMFirst cluster recreation attempt, engineer waits Angus Hollands #server_start_inc_1924 Create still in progress.
7:02 PMEngineer attempts another cluster creation after determining proper CAPI version Angus Hollands #server_start_inc_1924 OK, trying a new TF deployment. I’m expecting this to take several hours based upon the docs.
7:05 PMEngineer reports failure to upgrade in attempt to fix credentials Angus updated the status of the incident through the website. Message: The authentication certificates for this cluster expired. Once we caught this, we went to upgrade the cluster which is the suggested mechanism for renewing them. This upgrade failed, requiring us to re-create the cluster. This is a slow process, and we’ll update once there’s more news. [FIRING:1] At least two servers failed to start in the last 30m projectpythia-binder binderhub 3 (kubeconfig immediate action needed)
10:21 PMEngineer reports success of cluster recreation Note added by Angus. Resolution Note: We recreated the cluster, and have confirmed that we can spawn Binder instances that trigger scale up. [FIRING:1] At least two servers failed to start in the last 30m projectpythia-binder binderhub 3 (kubeconfig immediate action needed) Description: [FIRING:1] At least two servers failed to start in the last 30m projectpythiabinder binderhub 3 (kubeconfig immediate action needed) (View Message) [FIRING:1] At least two servers failed to start in the last 30m projectpythia-binder binderhub 3 (kubeconfig immediate action needed) Engineer in UK timezone starts to investigate error Acknowledged by Angus through the website. [FIRING:1] At least two servers failed to start in the last 30m projectpythia-binder binderhub 3 (kubeconfig immediate action needed)
5:19 PMEngineer cannot access Kubernetes API server Angus Hollands #server_start_inc_1924 I cannot connect to the JS2 cluster due to a certificate error, and the same is true of spawning jobs in the cluster itself.
5:39 PMEngineer observes previous update Angus Hollands #server_start_inc_1924 Looks like it was updated yesterday and is not healthy: 10:21 PM [FIRING:1] At least two servers failed to start in the last 30m projectpythia-binder binderhub 3 (kubeconfig immediate action needed)