EarthScope Investigation
| Field | Value |
|---|---|
| Impact Time | Unknown |
| Duration | Unknown |
Overview¶
This is an addendum to the initial incident report (reports
Summary
On May 12 2025 we received a Freshdesk support ticket about user’s servers taking >15min to start and their kernels being killed. This was happening during a workshop and was also impacting other instructors trying to use the hub for different purposes.
Upon investigation, we found there was interplay between two factors:
Many users coming on at the same time, causing some slow-down in new node spin-ups and user startup (expected)
Resource contention for existing users on existing nodes (due to lack of appropriate resource limits set) was causing a splew of unexpected errors after the users have started (like blank screens, unstarted kernels, etc)
Further, we also discovered some gaps in our instrumentation that would have helped us diagnose and make better choices earlier on, which are since addressing. We also developed some additional tooling during the course of this investigation that helps us provide more confidence that our infrastructure can handle specific workloads.
What Happened¶
At the time of the incident, we did not have explicit CPU or memory limits set for user servers - we set only guarantees. This meant that when you have a number of users trying to use all the CPU (or memory) available to them, some users (non-deterministically) may not get much CPU (or memory). We believe this is what happened here at the start, and why our team’s actions after we were notified resolved the issue for the next days.
This graph (times in PST) illustrates the total number of nodes with a high CPU usage. After our team’s intervention, users picked a profile option (2i2c
Resolution¶
We were already aware this (notice the PR description (2i2c
Other hypotheses explored
We explored a number of different hypotheses, and will describe them briefly here. We ruled out all of these, but investigating them helped increase our confidence in what happened + improved our infrastructure’s reliability in other ways.
jupyterhub-home-nfs migration causing issues
We had moved from using AWS EFS to AWS EBS (with jupyterhub-home-nfs (https://
We attempted to recreate the issue and stress test the home directory setup by:
Creating the jupyterhub-simulator (http://
github .com /2i2c -org /jupyterhub -simulator) project Simulate a large number of users starting up, and doing a lot of disk iops
Manually try to start a server, and see if we can reproduce the reported symptoms while the disk was fully saturated.
Despite multiple attempts, we were not able to replicate the described symptoms. We ruled out this hypothesis, but added a saturation alert (2i2c
Realtime collaboration in JupyterLab causing slow startup
JupyterLab’s real time collaboration feature relies on a sqlite database to store state, and this causes issues on NFS (https://
We ruled this out because the earthscope image in use did not have the new version of JupyterLab with this functionality! We still added additional mitigation (2i2c
cluster-autoscaler brings up nodes sequentially, not parallelly
We use the cluster-autoscaler (https://
Systemic Improvements
Ultimately outages don’t have any single cause, but are systemic failures. As such, we tried to make systemic improvements based on our experience mitigating this particular outage, applied both to earthscope and other communities. Here’s a short list of them:
Move as many communities as possible out of the ‘legacy’ setup with no resource limits to the newer, more user friendly resource selection options. This is the most critical improvement.
Improve our logging and metrics collection so we can do better post-hoc analysis to understand what went wrong where. This is an ongoing process.
Do two quarters of work in increasing our overall systemic resiliency, with a particular focus on alerting and incident response process.
Increase our infrastructure engineering capacity (Angus Hollands is now an infrastructure engineer at 2i2c!)
Work on systematizing our community engagement process better, so we can understand community needs and address them in more consistent, systematic ways.
Create and invest in the jupyterhub-simulator (http://
github .com /2i2c -org /jupyterhubsimulator) project so we can slowly but surely simulate exact workshop workloads before a workshop, and have more faith in the infrastructure.