[opensci big-binder] Dask nodegroups are not available
| Field | Value |
|---|---|
| Impact Time | Jan 12 at 15:15 to Jan 12 at 17:06 |
| Duration | 1h 50m 10s |
What Happened¶
During a k8s routine upgrade, we reached the the max number of managed node groups per cluster, so the dask nodegroups didn’t get created. We did not get any obvious info about this when creating the nodegroups from the CLI, so we were oblivious of the problem.
This meant that no dask sessions could be started.
Resolution¶
Deleting the nogroups, deleted the cloud formation stacks, which allowed us to recreate the dask nodegroups.
We also requested an increase for the managed node groups per cluster quota from 30 to 100.
Where We Got Lucky¶
Deleting and re-creating the nodegroups was a simple and handy fix that we could deliver quickly.
What Went Well¶
We had engineers on call when this happened, so we could investigate the problem pretty quickly.
What Didn’t Go So Well¶
A similar outage had happened a month before, but we didn’t get to the root cause of it and thought is was an isolated event. Turns out it wasn’t.
We did not get alerts for pods stuck in a Pending state for too long, so we only found the issue when the two consecutive server startups were failing.
Action Items¶
Timeline¶
| Time | Event |
|---|---|
| 4:28 PM | Outage is confirmed Priority set to ‘P1’ by Georgiana. INCIDENT #1781 [FIRING:1] Two servers failed to start in the last 30m opensci big-binder (immediate action needed) |
| 4:29 PM | Engineer observes that no nodegroups are present in eksctl for the big-binder Note added by Georgiana. No nodegroups for the big-binder are actually present in the output of eksctl get nodegroups INCIDENT #1781 [FIRING:1] Two servers failed to start in the last 30m opensci big-binder (immediate action needed) |
| 4:30 PM | Engineer recreates nodepools Note added by Georgiana. I am deleting and recreating them INCIDENT #1781 [FIRING:1] Two servers failed to start in the last 30m opensci big-binder (immediate action needed) |
| 5:42 PM | Engineer declares incident resolved Georgiana Dolocan #opensci-big-binder-managed-nodegroup-quota-jan-2026 both incidents are closed now and should not be impacting their users anymore |