Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Students experienced issues when a number of them were trying to login to the hub at the same time during start of class.

FieldValue
Impact TimeUnknown
Duration4d 1h 30m

What Happened

Due to a large number of users starting up at the same time, the concurrent spawn limit of 64 was reached quickly. New nodes had to be brought up by the autoscaler, and since this took rougly 10 mins from start to end, users trying again after 1 minute aren’t guaranteed to get things immediately placed.

Resolution

  1. Increase the concurrent spawn limit from 64 to 100 2i2c-org/infrastructure#6674

  2. Put ucmerced users on larger nodes, so fewer node spinups are needed 2i2c-org/infrastructure#6673

Where We Got Lucky

  1. On GCP, we have extensive log persistent capabilities. This allowed us to look back at logs past kubernetes’ default retention period, resolving the issue. We lack this on AWS, so we got lucky that this hub was on GCP

What Went Well

  1. Once we could see the 429 in the logs, we could put some mitigations in place easily.

What Didn’t Go So Well

  1. We do not have an alert for this, so we had to find out about the issue from users rather than automated alerts

  2. JupyterHub’s metrics don’t seem to expose multiple 429 status codes correctly

Action Items

  1. Collect pod logs and control plane logging for AWS too: 2i2c-org/infrastructure#6688 2i2c-org/infrastructure#6219

  2. Increase the concurrent server limit from 64 2i2c-org/infrastructure#6674 (done)

  3. Investigate why 429 status responses weren’t showing up in

Grafana 2i2c-org/infrastructure#6689

  1. Reduce the number of new nodes that need to come up to serve ucmerced https://github.com/2i2c-org/infrastructure/

pull/6673

  1. Investigate an alert for many user server startups being throttled https://github.com/2i2 issues/6690

Yuvi Panda

Pacific Time (U

Timeline

TimeEvent
9:46AMDue to influx of users, the autoscaler goes from 2 to 7 user nodes. Request for
9:50AM63 users are pending their servers starting up. JupyterHub’s concurrent pendi starts responding to users with a ‘429’ status code, asking them to try again i since the new nodes are not up yet, trying again after one minute (rougly) give
9:58AM7 user nodes are up, and users are able to login fine when they try to login now
3:33PMThe issue is reported to us via freshdesk: https://2i2c.freshdesk.com/a/ticket
3:52PMTriggered by Yuvi Panda through the website. Description: UCMerced Outage (View Message) INCIDENT #1317 UCMerced: Too Many Users Starting up at the same time