Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Core node restarts on LEAP

FieldValue
Impact TimeJan 13 at 20:10 to Jan 14 at 14:02
Duration17h 52m 20s

Overview

A workshop using the leap production hub experienced login loops. This was caused by repeated restarts of the proxy.

What Happened

The JupyterHub proxy was killed repeatedly for exceeding its memory limit. Meanwhile, core nodes were repeatedly cycling in addition to the “primary” node running the hub and the proxy.

Resolution

To increase the proxy memory limit to parity with the default 2i2c hub deployment configuration.

Where We Got Lucky

The cluster was on GCP, so our logging infrastructure was very good.

What Didn’t Go So Well

We had no alerts to this failure, particularly the restarts of the proxy.

Action Items

Timeline

TimeEvent
8:02 PMUsers experience service disruption
8:35PM2i2c support reaches out to out-of-hours engineer
9:11PMThe engineer formally declares an incident
9:12PMEngineer reports core-node cycling, proxy restarts, and OOMkills
9:31PMEngineer investigates core node behaviour
11:30PMEngineer switches hypothesis to two separate incidents
12:06AMEngineer identifies low memory limit on proxy
12:15AMEngineer leaves cluster untouched for further analysis in the morning
12:20PMEngineer deploys modified proxy configuration
1:52PMEngineer tears down long-running core node
2:02PM