Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

[opensci big-binder] Dask nodegroups are not available

FieldValue
Impact TimeJan 12 at 15:15 to Jan 12 at 17:06
Duration1h 50m 10s

What Happened

During a k8s routine upgrade, we reached the the max number of managed node groups per cluster, so the dask nodegroups didn’t get created. We did not get any obvious info about this when creating the nodegroups from the CLI, so we were oblivious of the problem.

This meant that no dask sessions could be started.

Resolution

Deleting the nogroups, deleted the cloud formation stacks, which allowed us to recreate the dask nodegroups.

We also requested an increase for the managed node groups per cluster quota from 30 to 100.

Where We Got Lucky

Deleting and re-creating the nodegroups was a simple and handy fix that we could deliver quickly.

What Went Well

We had engineers on call when this happened, so we could investigate the problem pretty quickly.

What Didn’t Go So Well

A similar outage had happened a month before, but we didn’t get to the root cause of it and thought is was an isolated event. Turns out it wasn’t.

We did not get alerts for pods stuck in a Pending state for too long, so we only found the issue when the two consecutive server startups were failing.

Action Items

Timeline

TimeEvent
4:28 PMOutage is confirmed Priority set to ‘P1’ by Georgiana. INCIDENT #1781 [FIRING:1] Two servers failed to start in the last 30m opensci big-binder (immediate action needed)
4:29 PMEngineer observes that no nodegroups are present in eksctl for the big-binder Note added by Georgiana. No nodegroups for the big-binder are actually present in the output of eksctl get nodegroups INCIDENT #1781 [FIRING:1] Two servers failed to start in the last 30m opensci big-binder (immediate action needed)
4:30 PMEngineer recreates nodepools Note added by Georgiana. I am deleting and recreating them INCIDENT #1781 [FIRING:1] Two servers failed to start in the last 30m opensci big-binder (immediate action needed)
5:42 PMEngineer declares incident resolved Georgiana Dolocan #opensci-big-binder-managed-nodegroup-quota-jan-2026 both incidents are closed now and should not be impacting their users anymore