K8s-autoscaler version incompatibility on berkeley-geoupyter cluster
| Field | Value |
|---|---|
| Impact Time | Mar 31 at 15:10 to Mar 31 at 17:00 |
| Duration | 1h 49m 10s |
Overview¶
The cluster was running an older k8s version than the majority of clusters. Once the cluster autoscaler version was bumped a day prior, the two became incompatible.
What Happened¶
New nodes were not being spawned because the cluster autoscaler wasn’t triggering scale-up events.
Resolution¶
Upgrading the k8s version of the cluster from 1.33 to 1.34 and downgrading the cluster autoscaler, one patch version, fixed it.
Where We Got Lucky¶
The problem was triggered by an automatic health check run and not by actual users not being able to spawn servers.
What Went Well¶
Context about the cluster autoscaler version bump was fresh, so we went on the correct path from the beginning.
What Didn’t Go So Well¶
When we designed the upgrade batches, we only took into account clusters running 1.32, so this 1.33 cluster was missed.
Action Items¶
upgrades.yaml should help standardize versions across clusters