Incident report July 21 2025 - Openscapes hub pods dying post-stress-testing
Summary¶
On July 21, 2025, we received a support request from Openscapes asking for confirmation that their hub would be resilient throughout a large, 700 attendee, workshop session the next day. That evening Yuvi completed a stress test, adding thousands of named servers to the database. Early the following morning, alerts fire when the Openscapes workshop hub fails a health check, with hub pod showing OOMKilled status.
Resolution¶
We all theorised that the hub pod dying was related to the stress testing Yuvi did the day before Angus noted that the hub database was only 1.6 MB, so we didn’t think the issues came from the database being too big
We still thought that the hub database may be corrupted somehow, so we decided to create a backup of the database and try and restart the hub pod without a database Restarting the hub without a database worked. We left that in place for the workshop hub, since there is no need for persistent user state in this use case
An upstream contribution to JupyterHub was made to address inefficient O(N^2) database table join operations.
Timeline¶
All times in EDT
July 22¶
1:03 First alert fires
5:00 Engineers acknowledge the alert
5:05 Incident declared. Georgiana notes after deleting the hub deployment and re-deploying the hub pod shows OOMKilled status
5:10 AM Team checks Grafana and and the hub DB
5:20 AM Georgiana increases the hub memory limit then see if we can see what’s eating up the memory
5:41 Team starts a live call, successfully restarts the hub
7:42
Angus assesses the DB, with the following results
Logs indicated we didn’t get past init_spawners
init_spwners performs a fairly complex operation: jupyterhub/app.py
I ran this locally on the downloaded DB, and the process was CPU bound for 96s, consuming >6GiB of RAM peak.
The DB had 736 servers and 1746 spawners.
~ https://
7:51 Georgiana and Jenny discuss maximizing hub pod memory limits to avoid issues during workshop
8:37
Georgiana opens 2i2c
10:23 AM Jenny monitors CloudWatch metrics and notices a spike in EBS write ops
10:27 FreshDesk indicates Openscapes has deleted a bunch of data from the shared folder
11:24 AM Angus investigates what JupyterHub is doing at startup that takes so long. It looks like the problem in the above query is the LEFT OUTER JOIN which comes from the joinedload(orm.User._orm_spawners) . This is performed to ensure that the field is loaded, which avoids later lazy DB lookups. In our case, we already have a JOIN on this field, and there is a one-to-many relationship between users and spawners, so this LEFT OUTER JOIN produces O(N) results for the O(N) JOIN. This means the query result scales O(N^2)! As such, Angus believes that if we encounter this situation again, we can’t effectively chase it down with more RAM because it’s nonlinear.
PR here, if it ends up being the right fix: jupyterhub
July 23 4:10 AM Jenny confirms peak of 235 participants at 18:00 UTC yesterday, a few OOM kills here and there, but hub pod did not restart Incident officially resolved!
Action Items¶
write an incident report