[dubois:ephemeral] Unable to start servers
| Field | Value |
|---|---|
| Impact Time | Feb 11 at 08:46 to Feb 11 at 22:44 |
| Duration | 13h 58m |
Overview¶
A community user image was updated from a spec without lockfiles, accidentally introducing a conflict between incompatible sqlite and libsqlite conda-forge packages. This prevented user servers from starting. The user image was using the latest tag, so the broken image was pulled in without audit.
What Happened¶
A series of startup failure alerts, and subsequent FreshDesk ticket were reported. The engineering team investigated the cause, and found the community image to be broken.
Resolution¶
Pinning the sqlite version in user image solves the issue. Long term systematic fix would be to use lockfiles in user images to avoid upstream package changes being pulled in with every image update..
Where We Got Lucky¶
The alerting system caught the problem and the incident happened on a timezone when most engineers were available.
What Went Well¶
Once we acknowledged the incident, we found the issue pretty quickly
The community had provided additional details via the ticket (like the user image repository) that we could use directly
What Didn’t Go So Well¶
We found out that persistent pod logs were missing since recent k8s upgrade and pursuing this issue delayed incident resolution
Community merged the PR pretty late in the day and additional guidance was needed for the fix to propagate
Action Items¶
Timeline¶
| Time | Event |
|---|---|
| 8:46AM | Alert is triggered, but is outside working hours of engineers Assigned to Herbie. |
| 9:05AM | FreshDesk ticket is opened by the community representative, letting us know about not being able to start servers |
| 11:30AM | Alert gets reassigned from the bot account to an engineer |
| 11:43AM | Engineer searches through hub logs. Notices that persistent logs are missing since k8s upgrade two weeks prior. Concludes that prometheus chart version was not compatible with the new k8s version. |
| 1:00PM | Engineer tries starting a server and fails. User server logs show missing pysqlite2 module. Notices that the user image was updated recently and suspects the outage is related to the changes. |
| 1:03PM | Engineer notices support ticket from community rep |
| 1:12PM | Alert is acknowledged as an outage and given a P1 priority Priority set to ‘P1’ by Georgiana. INCIDENT #1869 [FIRING:1] Two servers failed to start in the last 30mdubois ephemeral (immediate action needed) |
| 1:20PM | Second engineer comes along and notices the user image is not using lockfiles, which explains why recent changes might have produced a different environment solve. |
| 1:57PM | Engineer confirms that in the latest solve, the sqlite pkg version was downgraded, while libsql pkg was not |
| 2:22PM | Engineer opens up a fix PR in the community-maintained repository, holding the user image. They then let the community know about the fix. |
| 3:05PM | Outage is declared as resolved |
| 7:41PM | Community merges the PR and lets us know that the issue still persists |
| 8:55PM | Third engineer tries to reproduce the error, but they can’t. Suggests the problem might be because of the use of the latest tag for the user image. |
| 9:05PM | Community rep tries spawning a server from a different server and observes that the issue is gone |
| 9:08PM | Engineer updates the image_pull_policy to Always to make sure the latest image is always pulled onto the node. They then let the community know that it takes time after the PR merge for the process to finish and they’ve updated the infrastructure to improve it. |
| 10:44PM | Community confirms everything is ok |
| 3:15PM | Outage is closed |