Skip to article frontmatterSkip to article content

UCMerced: Too Many Users Starting up at the same time

FieldValue
Impact TimeAug 29 at 09:00 to Sep 2 at 10:30
Duration1h 30m

Overview

Students experienced issues when a number of them were trying Yuvi Panda to login to the hub at the same time during start of class.

What Happened

Due to a large number of users starting up at the same time, the Aug 29 at 09:00 to Sep 2 at 10:30 concurrent spawn limit of 64 was reached quickly. New nodes had to be brought up by the autoscaler, and since this took rougly 10 mins from start to end, users trying again after 1 minute aren’t guaranteed to get things immediately placed. 4d 1h 30m

Resolution

  1. Increase the concurrent spawn limit from 64 to 100 https:// github.com/2i2c-org/infrastructure/pull/6674 . needed 2i2c-org/infrastructure#6673

Where We Got Lucky

  1. On GCP, we have extensive log persistent capabilities. This allowed us to look back at logs past kubernetes’ default retention period, resolving the issue. We lack this on AWS, so we got lucky that this hub was on GCP

What Went Well

  1. Once we could see the 429 in the logs, we could put some mitigations in place easily.

What Didn’t Go So Well

  1. We do not have an alert for this, so we had to find out about the issue from users rather than automated alerts

  2. JupyterHub’s metrics don’t seem to expose multiple 429 status codes correctly

Action Items

  1. Collect pod logs and control plane logging for AWS too: https:// github.com/2i2c-org/infrastructure/issues/6688https:// github.com/2i2c-org/infrastructure/issues/6219

  2. Increase the concurrent server limit from 64 https:// github.com/2i2c-org/infrastructure/pull/6674 (done)

  3. Investigate why 429 status responses weren’t showing up in Grafana 2i2c-org/infrastructure#6689

  4. Reduce the number of new nodes that need to come up to serve ucmerced https://github.com/2i2c-org/infrastructure/ pull/6673

  5. Investigate an alert for many user server startups being throttled https://github.com/2i2c-org/infrastructure/ issues/6690

Timeline

Aug 29, 2025

TimeEvent
9:00 AMA class of ~270 students start using the hub for a live coding session.
9:46 AMDue to influx of users, the autoscaler goes from 2 to 7 user nodes. Request for new nodes is initiated.
9:50 AM63 users are pending their servers starting up. JupyterHub’s concurrent pending spawn limit is 64, so it starts responding to users with a ‘429’ status code, asking them to try again in a minute. However, since the new nodes are not up yet, trying again after one minute (rougly) gives them the same error
9:58 AM7 user nodes are up, and users are able to login fine when they try to login now
3:33 PMThe issue is reported to us via freshdesk: https://2i2c.freshdesk.com/a/tickets/3825 3:52 PM Description:UCMerced Outage (View Message) UCMerced: Too Many Users Starting up at the same time