Students experienced issues when a number of them were trying to login to the hub at the same time during start of class.
| Field | Value |
|---|---|
| Impact Time | Unknown |
| Duration | 4d 1h 30m |
What Happened¶
Due to a large number of users starting up at the same time, the concurrent spawn limit of 64 was reached quickly. New nodes had to be brought up by the autoscaler, and since this took rougly 10 mins from start to end, users trying again after 1 minute aren’t guaranteed to get things immediately placed.
Resolution¶
Where We Got Lucky¶
On GCP, we have extensive log persistent capabilities. This allowed us to look back at logs past kubernetes’ default retention period, resolving the issue. We lack this on AWS, so we got lucky that this hub was on GCP
What Went Well¶
Once we could see the 429 in the logs, we could put some mitigations in place easily.
What Didn’t Go So Well¶
We do not have an alert for this, so we had to find out about the issue from users rather than automated alerts
JupyterHub’s metrics don’t seem to expose multiple 429 status codes correctly
Action Items¶
Increase the concurrent server limit from 64 https://github.com/2i2c-org/infrastructure/pull/6674 (done)
Investigate why 429 status responses weren’t showing up in
Grafana https://github.com/2i2c-org/infrastructure/issues/6689
Investigate an alert for many user server startups being throttled https://
github .com /2i2 issues/6690
Pacific Time (U
Timeline¶
| Time | Event |
|---|---|
| 9:46AM | Due to influx of users, the autoscaler goes from 2 to 7 user nodes. Request for |
| 9:50AM | 63 users are pending their servers starting up. JupyterHub’s concurrent pendi starts responding to users with a ‘429’ status code, asking them to try again i since the new nodes are not up yet, trying again after one minute (rougly) give |
| 9:58AM | 7 user nodes are up, and users are able to login fine when they try to login now |
| 3:33PM | The issue is reported to us via freshdesk: https:// |
| 3:52PM | Triggered by Yuvi Panda through the website. Description: UCMerced Outage (View Message) INCIDENT #1317 UCMerced: Too Many Users Starting up at the same time |