Skip to article frontmatterSkip to article content

UToronto JupyterHub not accessible

FieldValue
Impact TimeAug 30 at 11:15 to Aug 30 at 12:13
DurationUnknown

Overview

The UToronto JupyterHub was unreachable due to all nodes in the Yuvi Panda kubernetes cluster being unavailable. This was due to Microsoft Azure rolling out a faulty update to all Ubuntu 18.04 nodes - https://bugs.launchpad.net/ubuntu/+source/systemd /+bug/1988119. We cycled all existing nodes by upgrading the cluster, fixing the outage. Aug 30 at 11:15 to Aug 30 at 12:13

What Happened

Microsoft Azure rolled out an Ubuntu update to all Ubuntu 18.04 nodes. This is usually fairly harmless, and hundreds of these 58m happen in an automated fashion without us noticing. In this case, a bug was triggered - https://bugs.launchpad.net/ubuntu /+source/systemd/+bug/1988119. This caused networking to fail for all existing kubernetes nodes, thus bringing the entire hub .

Resolution

According to https://bugs.launchpad.net/ubuntu/+source /systemd/+bug/1988119, a simple reboot would’ve fixed the issues. We had to upgrade the kubernetes cluster anyway (2i2c-org/infrastructure#1067), and since an upgrade would also give us new nodes, we just initiated the upgrade immediately. It gave us new nodes as expected, and the outage was resolved.

Where We Got Lucky

What Went Well

What Didn’t Go So Well

Action Items

Timeline

Aug 30, 2022

TimeEvent
11:15 AMOutage reported, as jupyter.utoronto.ca is not responding https://2i2c.freshdesk.com/a/tickets/181 is where it was reported
11:20 AMIssue is acknowledged
11:25 AMA Microsoft Advisory about a DHCP bug in Ubuntu is forwarded to us via freshdesk The contents of the advisory are: “Azure customers running Canonical Ubuntu 18.04 experiencing DNS errors - Investigating Starting at approximately 06:00 UTC on 30 Aug 2022, a number of customers running Ubuntu 18.04 (bionic) VMs recently upgraded to systemd version 237-3ubuntu10.54 reported experiencing DNS errors when trying to access their resources. Reports of this issue are confined to this single Ubuntu version. This bug and a potential fix have been highlighted on the Canonical / Ubuntu site, which we encourage impacted customers to read: https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1988119 An additional potential workaround customers can consider is to reboot impacted VM instances so that they receive a fresh DHCP lease and new DNS resolver(s). If you are running a VM with Ubuntu 18.04 image, and you are experiencing connectivity issues, we recommend you evaluate the above mitigation options. If you are not experiencing impact on your Ubuntu 18.04 images, but you have unattended security updates enabled, we recommend you review this setting until the Ubuntu issue is mitigated. The next update will be provided in 6 hours, or as events warrant.”
11:41 AMAll nodes of the kubernetes cluster are seen to be in a NotReady state This is the output of kubectl get node aks-core-34239724-vmss000000 NotReady agent 273d v1.20.7 aks-core-34239724-vmss000001 NotReady agent 196d v1.20.7 aks-core-34239724-vmss000002 NotReady agent 196d v1.20.7 aks-nbdefault-34239724-vmss000010 NotReady agent 193d v1.20.7 aks-nbdefault-34239724-vmss00002b NotReady agent 148d v1.20.7
11:43 AMWe determine that there is a high likelihood that the NotReady nodes are related to the advisory provided Upon running kubectl get node -o yaml <node-name> for any of the NotReady nodes, we discover the label kubernetes.azure.com/node-image-version set to provide a version of Ubuntu that is a version affected by the provided microsoft advisory.
11:53 AMAn upgrade of the cluster’s kubernetes versions and node versions is kicked off The resolution for the listed Ubuntu bug is to reboot the nodes, and since the nodes aren’t found in the Azure Portal VMs page, this is considered the fastest way to ‘reboot’ the nodes. Upgrades will take down the existing nodes and replace them with new ones, and as per the advisory we should have working nodes after that! Note that this isn’t a super common occurrence, but in this case it was a DHCP bug that really just needed a reboot to fix. Reading the advisory provided this information - https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1988119/comments/3.
12:14 PMUpgrade is complete, and the hub is back online All the old nodes are retired, and new, fully functional nodes take their place. Service is restored.