UToronto JupyterHub not accessible - 2i2c Incident Reports

Field	Value
Impact Time	Aug 30 at 11:15 to Aug 30 at 12:13
Duration	Unknown

Overview¶

The UToronto JupyterHub was unreachable due to all nodes in the Yuvi Panda kubernetes cluster being unavailable. This was due to Microsoft Azure rolling out a faulty update to all Ubuntu 18.04 nodes - https://bugs.launchpad.net/ubuntu/+source/systemd /+bug/1988119. We cycled all existing nodes by upgrading the cluster, fixing the outage. Aug 30 at 11:15 to Aug 30 at 12:13

What Happened¶

Microsoft Azure rolled out an Ubuntu update to all Ubuntu 18.04 nodes. This is usually fairly harmless, and hundreds of these 58m happen in an automated fashion without us noticing. In this case, a bug was triggered - https://bugs.launchpad.net/ubuntu /+source/systemd/+bug/1988119. This caused networking to fail for all existing kubernetes nodes, thus bringing the entire hub .

Resolution¶

According to https://bugs.launchpad.net/ubuntu/+source /systemd/+bug/1988119, a simple reboot would’ve fixed the issues. We had to upgrade the kubernetes cluster anyway (2i2c-org/infrastructure#1067), and since an upgrade would also give us new nodes, we just initiated the upgrade immediately. It gave us new nodes as expected, and the outage was resolved.

Where We Got Lucky¶

We hadn’t performed the kubernetes cluster upgrade (2i2c-org/infrastructure#1067) yet, so it was an available option to perform upgrades
The customer representative saw the Azure advisory, and pointed it out to us on the Freshdesk ticket. This saved us a lot of time!

What Went Well¶

Once we determined the cause of the incident, we were able to take an action that brought us back up very quickly

What Didn’t Go So Well¶

We don’t have a documented way to SSH into nodes of our kubernetes clusters

Action Items¶

Document how to SSH into our kubernetes nodes 2i2c-org/infrastructure#1709
Figure out how we can be notified of Azure advisories 2i2c-org/infrastructure#1716

Timeline¶

Aug 30, 2022¶

Time	Event
11:15 AM	Outage reported, as jupyter.utoronto.ca is not responding https://2i2c.freshdesk.com/a/tickets/181 is where it was reported
11:20 AM	Issue is acknowledged
11:25 AM	A Microsoft Advisory about a DHCP bug in Ubuntu is forwarded to us via freshdesk The contents of the advisory are: “Azure customers running Canonical Ubuntu 18.04 experiencing DNS errors - Investigating Starting at approximately 06:00 UTC on 30 Aug 2022, a number of customers running Ubuntu 18.04 (bionic) VMs recently upgraded to systemd version 237-3ubuntu10.54 reported experiencing DNS errors when trying to access their resources. Reports of this issue are confined to this single Ubuntu version. This bug and a potential fix have been highlighted on the Canonical / Ubuntu site, which we encourage impacted customers to read: https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1988119 An additional potential workaround customers can consider is to reboot impacted VM instances so that they receive a fresh DHCP lease and new DNS resolver(s). If you are running a VM with Ubuntu 18.04 image, and you are experiencing connectivity issues, we recommend you evaluate the above mitigation options. If you are not experiencing impact on your Ubuntu 18.04 images, but you have unattended security updates enabled, we recommend you review this setting until the Ubuntu issue is mitigated. The next update will be provided in 6 hours, or as events warrant.”
11:41 AM	All nodes of the kubernetes cluster are seen to be in a NotReady state This is the output of `kubectl get node` `aks-core-34239724-vmss000000 NotReady agent 273d v1.20.7 aks-core-34239724-vmss000001 NotReady agent 196d v1.20.7 aks-core-34239724-vmss000002 NotReady agent 196d v1.20.7 aks-nbdefault-34239724-vmss000010 NotReady agent 193d v1.20.7 aks-nbdefault-34239724-vmss00002b NotReady agent 148d v1.20.7`
11:43 AM	We determine that there is a high likelihood that the NotReady nodes are related to the advisory provided Upon running `kubectl get node -o yaml <node-name>` for any of the `NotReady` nodes, we discover the label `kubernetes.azure.com/node-image-version` set to provide a version of Ubuntu that is a version affected by the provided microsoft advisory.
11:53 AM	An upgrade of the cluster’s kubernetes versions and node versions is kicked off The resolution for the listed Ubuntu bug is to reboot the nodes, and since the nodes aren’t found in the Azure Portal VMs page, this is considered the fastest way to ‘reboot’ the nodes. Upgrades will take down the existing nodes and replace them with new ones, and as per the advisory we should have working nodes after that! Note that this isn’t a super common occurrence, but in this case it was a DHCP bug that really just needed a reboot to fix. Reading the advisory provided this information - https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1988119/comments/3.
12:14 PM	Upgrade is complete, and the hub is back online All the old nodes are retired, and new, fully functional nodes take their place. Service is restored.