UToronto Hub is throwing 500 errors when users try to login
| Field | Value |
|---|---|
| Impact Time | Sep 6 at 05:25 to Sep 6 at 10:48 |
| Duration | 5h 23m |
Overview¶
Users were getting 500 errors when they tried to login to the Yuvi Panda University of Toronto JupyterHub. This was repeat of an earlier outage from a week ago, as the fix for that issue (new credentials for AzureAD) were not made permanent with a PR to our infrastructure repo. The outage was fixed when the credentials were added to our repository. Sep 6 at 05:25 to Sep 6 at 10:48
What Happened¶
After https://
Resolution¶
Upon discovery, a local deployment from the same engineer who
dealt with this last time resolved it by providing correct
credentials. This was immediately put into CI/CD with
2i2c
Where We Got Lucky¶
The Laptop with the local commit containing the new credentials was still around. The engineer who made the fix the last time had a laptop failure just after the fix, and it was pure luck that the laptop that failed was not the laptop that had the new credentials!
This happened 2 days before classes started, not 2 days after.
What Went Well¶
Our deployment scripts are great - a deployment succeeded and immediately restored service once started
What Didn’t Go So Well¶
The application secret expires every year, which is very unideal.
The existing incident report got lost in GitHub, and it was not clear to the team that an important step had been missed from the resolution to the previous outage.
We did not have a non-single-point-of-failure way for University IT to share credentials with our team in a secure fashion, so instead it was shared with a single engineer - causing a single point of process failure.
Our escalation policies weren’t clearly defined, so it was unclear if the one engineer known to be in possession of the new credentials could be paged or not.
Action Items¶
Consider increasing validity of the UToronto AzureAD credentials - https://
github .com /2i2c -org /infrastructure /issues/1693 Make sure we catch the next expiry before it happens - 2i2c
-org /infrastructure #1694 Institute a way for community representatives to send us secrets in a way that does not create a single point of failure - 2i2c
-org /infrastructure #639 Clarify our escalation policies during hub outages - 2i2c
-org /infrastructure #1118 Re-work our incident response process to prevent steps being accidentally missed - https://
github .com /2i2c- org/team-compass/pull/508 Additional Notes We are trialing using PagerDuty to improve our escalation procedures. This post-mortem feature is an important reason we are considering using incident response service such as PagerDuty. In order to create this report, there had to be an corresponding incident in the system. The PagerDuty Notifications/Acknowledgements logged on Sept 8 did not occur during the actual outage but were created after the fact to allow us to produce this post-mortem report.
Timeline¶
Aug 31, 2022¶
| Time | Event |
|---|---|
| 2:35 PM | (1 week prior) The UToronto Hub is reported as down, with 500 errors being thrown when users try to log in. Reported via https:// |
| 4:20 PM | The issue is resolved - the AzureAD credentials used by the hub has expired, and needed to be renewed. Toronto IT reached out, and provided new credentials. This was committed locally and deployed, but this was not pushed to the repo |
Sep 7, 2022¶
| Time | Event |
|---|---|
| 5:25 AM | University of Toronto Hub is redeployed using the older expired credentials in the repo, marking the beginning of the outage https:// |
| 7:07 AM | UToronto community rep reports hub is down: https:// |
| 7:21 AM | Issue is acknowledged, and posted on Slack: Message says: “UToronto is reporting 500 errors for all users on the hub: https:// |
| 7:35 AM | Cause of the outage is determined to be the missing commit: Message says: “I wonder if deployed locally with the new secret and then a CI redeploy overwrote that?” https:// |
| 10:47 AM | Engineer with access to new credentials comes online |
| 10:50 AM | Yet another local redeploy fixes the issue immediately: Commands used were: 1. git checkout <branch-name> to the local branch with the new credentials 2. python3 deployer deploy utoronto prod to do a deploy https:// |
| 10:51 AM | New PR is put up and merged to make sure the credentials are persisted for future deployments 2i2c |