Core node restarts on LEAP
| Field | Value |
|---|---|
| Impact Time | Jan 13 at 20:10 to Jan 14 at 14:02 |
| Duration | 17h 52m 20s |
Overview¶
A workshop using the leap production hub experienced login loops. This was caused by repeated restarts of the proxy.
What Happened¶
The JupyterHub proxy was killed repeatedly for exceeding its memory limit. Meanwhile, core nodes were repeatedly cycling in addition to the “primary” node running the hub and the proxy.
Resolution¶
To increase the proxy memory limit to parity with the default 2i2c hub deployment configuration.
Where We Got Lucky¶
The cluster was on GCP, so our logging infrastructure was very good.
What Didn’t Go So Well¶
We had no alerts to this failure, particularly the restarts of the proxy.
Action Items¶
Investigate / evaluate persistent storage backend for the proxy.
Improve alerting to catch proxy/hub restarts.
Investigate core node cycling on GCP / leap
Timeline¶
| Time | Event |
|---|---|
| 8:02 PM | Users experience service disruption |
| 8:35PM | 2i2c support reaches out to out-of-hours engineer |
| 9:11PM | The engineer formally declares an incident |
| 9:12PM | Engineer reports core-node cycling, proxy restarts, and OOMkills |
| 9:31PM | Engineer investigates core node behaviour |
| 11:30PM | Engineer switches hypothesis to two separate incidents |
| 12:06AM | Engineer identifies low memory limit on proxy |
| 12:15AM | Engineer leaves cluster untouched for further analysis in the morning |
| 12:20PM | Engineer deploys modified proxy configuration |
| 1:52PM | Engineer tears down long-running core node |
| 2:02PM |