Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

UToronto: Users who have never logged in before can't start servers

UToronto: Users who have never logged in before can’t start servers

FieldValue
Impact TimeOct 3 at 09:11 to Oct 3 at 09:34
Duration23m 24s

Overview

Azure and University of Toronto is using Azure File as home directory storage, and needs the chowning initcontainer. We had removed it earlier, causing new server startups for users who had never logged in before to fail. Restoring it just for utoronto fixed it.

Where We Got Lucky

  1. We had an engineer in the US time who checked freshdesk accidentally (otherwise this would’ve persisted for at least 3 full days)

What Went Well

  1. We were able to restore service pretty quickly once the report was acknowledged

What Didn’t Go So Well

  1. Our alerting didn’t catch this, so we had to wait for the community to catch it and report it to us. This also slowed down our investigative work, because we don’t know exactly where the 500 error was from

  2. Our logs had no mention of this particular username, and it is unclear why

Action Items

Timeline

TimeEvent
8:00AM2i2c-org/infrastructure#6873 was merged, removing initContainers doing chown from our infrastructure following rollout of jupyterhub-home-nfs everywhere
7:00AMhttps://2i2c.freshdesk.com/a/tickets/4038 comes in, reporting that some users have trouble starting servers with ‘500 Internal Server’ errors since the previous day
9:11AMAcknowledged as an outage and created pagerduty P1 incident Triggered by Yuvi Panda through Slack. Description: UToronto: Users who have never logged in before can’t start servers (View Message) INCIDENT #1538
9:15AMChecking hub logs, both existing and in jupyterhub.log on the persistent dir for the username of the user who had issues turns up nothing. Issue with login service is considered - it is an ‘internal server error’, but without clear idea of which service it’s coming from.
9:20AMAn engineer is able to recreate the issue by deleting their own home directory and trying to start a server (details in 2i2c-org/infrastructure#6888). This was attempted because of intuition + remembering that there were recent changes in initContainers.
9:30AM2i2c-org/infrastructure#6887 was deployed locally, restoring service. This was communicated to the community
9:34AMResolved by Yuvi Panda through the website. INCIDENT #1538 UToronto: Users who have never logged in before can’t start servers