UToronto JupyterHub not accessible
| Field | Value |
|---|---|
| Impact Time | Aug 30 at 11:15 to Aug 30 at 12:13 |
| Duration | Unknown |
Overview¶
The UToronto JupyterHub was unreachable due to all nodes in the
Yuvi Panda
kubernetes cluster being unavailable. This was due to Microsoft
Azure rolling out a faulty update to all Ubuntu 18.04 nodes -
https://
What Happened¶
Microsoft Azure rolled out an Ubuntu update to all Ubuntu 18.04
nodes. This is usually fairly harmless, and hundreds of these
58m
happen in an automated fashion without us noticing. In this case,
a bug was triggered - https://
Resolution¶
According to https://
Where We Got Lucky¶
We hadn’t performed the kubernetes cluster upgrade (2i2c
-org /infrastructure #1067) yet, so it was an available option to perform upgrades The customer representative saw the Azure advisory, and pointed it out to us on the Freshdesk ticket. This saved us a lot of time!
What Went Well¶
Once we determined the cause of the incident, we were able to take an action that brought us back up very quickly
What Didn’t Go So Well¶
We don’t have a documented way to SSH into nodes of our kubernetes clusters
Action Items¶
Document how to SSH into our kubernetes nodes 2i2c
-org /infrastructure #1709 Figure out how we can be notified of Azure advisories 2i2c
-org /infrastructure #1716
Timeline¶
Aug 30, 2022¶
| Time | Event |
|---|---|
| 11:15 AM | Outage reported, as jupyter.utoronto.ca is not responding https:// |
| 11:20 AM | Issue is acknowledged |
| 11:25 AM | A Microsoft Advisory about a DHCP bug in Ubuntu is forwarded to us via freshdesk The contents of the advisory are: “Azure customers running Canonical Ubuntu 18.04 experiencing DNS errors - Investigating Starting at approximately 06:00 UTC on 30 Aug 2022, a number of customers running Ubuntu 18.04 (bionic) VMs recently upgraded to systemd version 237-3ubuntu10.54 reported experiencing DNS errors when trying to access their resources. Reports of this issue are confined to this single Ubuntu version. This bug and a potential fix have been highlighted on the Canonical / Ubuntu site, which we encourage impacted customers to read: https:// |
| 11:41 AM | All nodes of the kubernetes cluster are seen to be in a NotReady state This is the output of kubectl get node aks-core-34239724-vmss000000 NotReady agent 273d v1.20.7 aks-core-34239724-vmss000001 NotReady agent 196d v1.20.7 aks-core-34239724-vmss000002 NotReady agent 196d v1.20.7 aks-nbdefault-34239724-vmss000010 NotReady agent 193d v1.20.7 aks-nbdefault-34239724-vmss00002b NotReady agent 148d v1.20.7 |
| 11:43 AM | We determine that there is a high likelihood that the NotReady nodes are related to the advisory provided Upon running kubectl get node -o yaml <node-name> for any of the NotReady nodes, we discover the label kubernetes.azure.com/node-image-version set to provide a version of Ubuntu that is a version affected by the provided microsoft advisory. |
| 11:53 AM | An upgrade of the cluster’s kubernetes versions and node versions is kicked off The resolution for the listed Ubuntu bug is to reboot the nodes, and since the nodes aren’t found in the Azure Portal VMs page, this is considered the fastest way to ‘reboot’ the nodes. Upgrades will take down the existing nodes and replace them with new ones, and as per the advisory we should have working nodes after that! Note that this isn’t a super common occurrence, but in this case it was a DHCP bug that really just needed a reboot to fix. Reading the advisory provided this information - https:// |
| 12:14 PM | Upgrade is complete, and the hub is back online All the old nodes are retired, and new, fully functional nodes take their place. Service is restored. |