Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

UToronto Hub is throwing 500 errors when users try to login

FieldValue
Impact TimeSep 6 at 05:25 to Sep 6 at 10:48
Duration5h 23m

Overview

Users were getting 500 errors when they tried to login to the Yuvi Panda University of Toronto JupyterHub. This was repeat of an earlier outage from a week ago, as the fix for that issue (new credentials for AzureAD) were not made permanent with a PR to our infrastructure repo. The outage was fixed when the credentials were added to our repository. Sep 6 at 05:25 to Sep 6 at 10:48

What Happened

After https://2i2c.freshdesk.com/a/tickets/183 was resolved one week ago, the final part of the fix - committing the new 5h 23m credentials to the 2i2c-org/infrastructure repository - was not completed. The resolution was a deployment from a local checkout. When a redeployment of the UToronto hub was triggered in CI/CD by an unrelated change, it reverted back to the . week.

Resolution

Upon discovery, a local deployment from the same engineer who dealt with this last time resolved it by providing correct credentials. This was immediately put into CI/CD with 2i2c-org/infrastructure#1688 and merged, to prevent a recurrence.

Where We Got Lucky

What Went Well

What Didn’t Go So Well

Action Items

Timeline

Aug 31, 2022

TimeEvent
2:35 PM(1 week prior) The UToronto Hub is reported as down, with 500 errors being thrown when users try to log in. Reported via https://2i2c.freshdesk.com/a/tickets/183
4:20 PMThe issue is resolved - the AzureAD credentials used by the hub has expired, and needed to be renewed. Toronto IT reached out, and provided new credentials. This was committed locally and deployed, but this was not pushed to the repo

Sep 7, 2022

TimeEvent
5:25 AMUniversity of Toronto Hub is redeployed using the older expired credentials in the repo, marking the beginning of the outage https://github.com/2i2c-org/infrastructure/actions/runs/2999517788 was an unrelated PR merge that (rightfully) retriggered a redeploy of all hubs from their state in the 2i2c- org/infrastructure repo. Unfortunately because the new credentials had not been committed to the repository yet, they reverted the hub back to the expired credentials, beginning the outage.
7:07 AMUToronto community rep reports hub is down: https://2i2c.freshdesk.com/a/tickets/188
7:21 AMIssue is acknowledged, and posted on Slack: Message says: “UToronto is reporting 500 errors for all users on the hub: https://2i2c.freshdesk.com/a/tickets/188https://2i2c.slack.com/archives /C028WU9PFBN/p1662560469509149
7:35 AMCause of the outage is determined to be the missing commit: Message says: “I wonder if deployed locally with the new secret and then a CI redeploy overwrote that?” https://2i2c.slack.com/archives/C028WU9PFBN /p1662561319162519?thread_ts=1662560469.509149&cid=C028WU9PFBN
10:47 AMEngineer with access to new credentials comes online
10:50 AMYet another local redeploy fixes the issue immediately: Commands used were: 1. git checkout <branch-name> to the local branch with the new credentials 2. python3 deployer deploy utoronto prod to do a deploy https://2i2c.slack.com/archives/C028WU9PFBN /p1662573041370449?thread_ts=1662562930.376269&cid=C028WU9PFBN
10:51 AMNew PR is put up and merged to make sure the credentials are persisted for future deployments 2i2c-org/infrastructure#1688