UCMerced: Too Many Users Starting up at the same time
| Field | Value |
|---|---|
| Impact Time | Aug 29 at 09:00 to Sep 2 at 10:30 |
| Duration | 1h 30m |
Overview¶
Students experienced issues when a number of them were trying Yuvi Panda to login to the hub at the same time during start of class.
What Happened¶
Due to a large number of users starting up at the same time, the Aug 29 at 09:00 to Sep 2 at 10:30 concurrent spawn limit of 64 was reached quickly. New nodes had to be brought up by the autoscaler, and since this took rougly 10 mins from start to end, users trying again after 1 minute aren’t guaranteed to get things immediately placed. 4d 1h 30m
Resolution¶
Increase the concurrent spawn limit from 64 to 100 https:// github
.com /2i2c -org /infrastructure /pull /6674 . needed 2i2c -org /infrastructure #6673
Where We Got Lucky¶
On GCP, we have extensive log persistent capabilities. This allowed us to look back at logs past kubernetes’ default retention period, resolving the issue. We lack this on AWS, so we got lucky that this hub was on GCP
What Went Well¶
Once we could see the 429 in the logs, we could put some mitigations in place easily.
What Didn’t Go So Well¶
We do not have an alert for this, so we had to find out about the issue from users rather than automated alerts
JupyterHub’s metrics don’t seem to expose multiple 429 status codes correctly
Action Items¶
Collect pod logs and control plane logging for AWS too: https:// github
.com /2i2c -org /infrastructure /issues /6688https:// github .com /2i2c -org /infrastructure /issues /6219 Increase the concurrent server limit from 64 https:// github
.com /2i2c -org /infrastructure /pull /6674 (done) Investigate why 429 status responses weren’t showing up in Grafana 2i2c
-org /infrastructure #6689 Reduce the number of new nodes that need to come up to serve ucmerced https://
github .com /2i2c -org /infrastructure/ pull/6673 Investigate an alert for many user server startups being throttled https://
github .com /2i2c -org /infrastructure/ issues/6690
Timeline¶
Aug 29, 2025¶
| Time | Event |
|---|---|
| 9:00 AM | A class of ~270 students start using the hub for a live coding session. |
| 9:46 AM | Due to influx of users, the autoscaler goes from 2 to 7 user nodes. Request for new nodes is initiated. |
| 9:50 AM | 63 users are pending their servers starting up. JupyterHub’s concurrent pending spawn limit is 64, so it starts responding to users with a ‘429’ status code, asking them to try again in a minute. However, since the new nodes are not up yet, trying again after one minute (rougly) gives them the same error |
| 9:58 AM | 7 user nodes are up, and users are able to login fine when they try to login now |
| 3:33 PM | The issue is reported to us via freshdesk: https:// |