Core node restarts on LEAP - 2i2c Incident Reports

Field	Value
Impact Time	Jan 13 at 20:10 to Jan 14 at 14:02
Duration	17h 52m 20s

Overview¶

A workshop using the leap production hub experienced login loops. This was caused by repeated restarts of the proxy.

What Happened¶

The JupyterHub proxy was killed repeatedly for exceeding its memory limit. Meanwhile, core nodes were repeatedly cycling in addition to the “primary” node running the hub and the proxy.

Resolution¶

To increase the proxy memory limit to parity with the default 2i2c hub deployment configuration.

Where We Got Lucky¶

The cluster was on GCP, so our logging infrastructure was very good.

What Didn’t Go So Well¶

We had no alerts to this failure, particularly the restarts of the proxy.

Action Items¶

Investigate / evaluate persistent storage backend for the proxy.
Improve alerting to catch proxy/hub restarts.
Investigate core node cycling on GCP / leap

Timeline¶

Time	Event
8:02 PM	Users experience service disruption
8:35PM	2i2c support reaches out to out-of-hours engineer
9:11PM	The engineer formally declares an incident
9:12PM	Engineer reports core-node cycling, proxy restarts, and OOMkills
9:31PM	Engineer investigates core node behaviour
11:30PM	Engineer switches hypothesis to two separate incidents
12:06AM	Engineer identifies low memory limit on proxy
12:15AM	Engineer leaves cluster untouched for further analysis in the morning
12:20PM	Engineer deploys modified proxy configuration
1:52PM	Engineer tears down long-running core node
2:02PM