Core node restarts on LEAP
| Field | Value |
|---|---|
| Impact Time | Jan 13 at 20:10 to Jan 14 at 14:02 |
| Duration | 17h 52m |
Overview¶
A workshop using the leap production hub experienced login Angus loops. This was caused by repeated restarts of the proxy.
What Happened¶
The JupyterHub proxy was killed repeatedly for exceeding its Jan 13 at 20:10 to Jan 14 at 14:02 memory limit. Meanwhile, core nodes were repeatedly cycling in addition to the “primary” node running the hub and the proxy.
Resolution¶
17h 52m 20s To increase the proxy memory limit to parity with the default 2i2c hub deployment configuration. *All times listed in this report are in
Where We Got Lucky¶
Edinburgh, London. The cluster was on GCP, so our logging infrastructure was very good.
What Didn’t Go So Well¶
We had no alerts to this failure, particularly the restarts of the proxy.
Action Items¶
Investigate / evaluate persistent storage backend for the proxy.
Improve alerting to catch proxy/hub restarts.
Investigate core node cycling on GCP / leap
Timeline¶
Jan 13, 2026¶
| Time | Event |
|---|---|
| 8:02 PM | Users experience service disruption Existing users of the hub encounter login-loops, and LEAP report this to 2i2c. |
| 8:35 PM | 2i2c support reaches out to out-of-hours engineer An out-of-hours engineer is assigned the incident, and prepares to investigate. |
| 9:11 PM | The engineer formally declares an incident Description:Core node restarts on LEAP (View Message) Core node restarts on LEAP |
| 9:12 PM | Engineer reports core-node cycling, proxy restarts, and OOM kills From an early investigation of various metrics and cluster properties, the engineer observes that the core-node pool is cycling through new nodes, and that the proxy pod has restarted multiple times. |
| 9:31 PM | Engineer investigates core node behaviour The engineer suspects proxy interruption is cause of UX degredation, which is likely related to the core nodes cycling |
| 11:30 PM | Engineer switches hypothesis to two separate incidents After spending time searching for root cause of core node cycling, the engineer switches hypothesis to separate incidents involving the proxy and the core nodes. |
Jan 14, 2026¶
| Time | Event |
|---|---|
| 12:06 AM | Engineer identifies low memory limit on proxy The LEAP proxy appears to have a smaller memory limit than the rest of 2i2c’s clusters |
| 12:15 AM | Engineer leaves cluster untouched for further analysis in the morning |
| 12:20 PM | Engineer deploys modified proxy configuration This increases the limit of the proxy in line with other 2i2c clusters, which may prevent the incident from recurring. |
| 1:52 PM | Engineer tears down long-running core node A long-running core node may be implicated in the core node cycling, and was removed from the node pool 2:02 PM Core node restarts on LEAP |