Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Core node restarts on LEAP

FieldValue
Impact TimeJan 13 at 20:10 to Jan 14 at 14:02
Duration17h 52m

Overview

A workshop using the leap production hub experienced login Angus loops. This was caused by repeated restarts of the proxy.

What Happened

The JupyterHub proxy was killed repeatedly for exceeding its Jan 13 at 20:10 to Jan 14 at 14:02 memory limit. Meanwhile, core nodes were repeatedly cycling in addition to the “primary” node running the hub and the proxy.

Resolution

17h 52m 20s To increase the proxy memory limit to parity with the default 2i2c hub deployment configuration. *All times listed in this report are in

Where We Got Lucky

Edinburgh, London. The cluster was on GCP, so our logging infrastructure was very good.

What Didn’t Go So Well

We had no alerts to this failure, particularly the restarts of the proxy.

Action Items

Timeline

Jan 13, 2026

TimeEvent
8:02 PMUsers experience service disruption Existing users of the hub encounter login-loops, and LEAP report this to 2i2c.
8:35 PM2i2c support reaches out to out-of-hours engineer An out-of-hours engineer is assigned the incident, and prepares to investigate.
9:11 PMThe engineer formally declares an incident Description:Core node restarts on LEAP (View Message) Core node restarts on LEAP
9:12 PMEngineer reports core-node cycling, proxy restarts, and OOM kills From an early investigation of various metrics and cluster properties, the engineer observes that the core-node pool is cycling through new nodes, and that the proxy pod has restarted multiple times.
9:31 PMEngineer investigates core node behaviour The engineer suspects proxy interruption is cause of UX degredation, which is likely related to the core nodes cycling
11:30 PMEngineer switches hypothesis to two separate incidents After spending time searching for root cause of core node cycling, the engineer switches hypothesis to separate incidents involving the proxy and the core nodes.

Jan 14, 2026

TimeEvent
12:06 AMEngineer identifies low memory limit on proxy The LEAP proxy appears to have a smaller memory limit than the rest of 2i2c’s clusters
12:15 AMEngineer leaves cluster untouched for further analysis in the morning
12:20 PMEngineer deploys modified proxy configuration This increases the limit of the proxy in line with other 2i2c clusters, which may prevent the incident from recurring.
1:52 PMEngineer tears down long-running core node A long-running core node may be implicated in the core node cycling, and was removed from the node pool 2:02 PM Core node restarts on LEAP