[dubois:ephemeral] Unable to start servers

Field	Value
Impact Time	Feb 11 at 08:46 to Feb 11 at 22:44
Duration	13h 58m

Overview¶

A community user image was updated from a spec without lockfiles, accidentally introducing a conflict between incompatible sqlite and libsqlite conda-forge packages. This prevented user servers from starting. The user image was using the latest tag, so the broken image was pulled in without audit.

What Happened¶

A series of startup failure alerts, and subsequent FreshDesk ticket were reported. The engineering team investigated the cause, and found the community image to be broken.

Resolution¶

Pinning the sqlite version in user image solves the issue. Long term systematic fix would be to use lockfiles in user images to avoid upstream package changes being pulled in with every image update..

Where We Got Lucky¶

The alerting system caught the problem and the incident happened on a timezone when most engineers were available.

What Went Well¶

Once we acknowledged the incident, we found the issue pretty quickly
The community had provided additional details via the ticket (like the user image repository) that we could use directly

What Didn’t Go So Well¶

We found out that persistent pod logs were missing since recent k8s upgrade and pursuing this issue delayed incident resolution
Community merged the PR pretty late in the day and additional guidance was needed for the fix to propagate

Action Items¶

Timeline¶

Time	Event
8:46AM	Alert is triggered, but is outside working hours of engineers Assigned to Herbie.
9:05AM	FreshDesk ticket is opened by the community representative, letting us know about not being able to start servers
11:30AM	Alert gets reassigned from the bot account to an engineer
11:43AM	Engineer searches through hub logs. Notices that persistent logs are missing since k8s upgrade two weeks prior. Concludes that prometheus chart version was not compatible with the new k8s version.
1:00PM	Engineer tries starting a server and fails. User server logs show missing pysqlite2 module. Notices that the user image was updated recently and suspects the outage is related to the changes.
1:03PM	Engineer notices support ticket from community rep
1:12PM	Alert is acknowledged as an outage and given a P1 priority Priority set to ‘P1’ by Georgiana. INCIDENT #1869 [FIRING:1] Two servers failed to start in the last 30mdubois ephemeral (immediate action needed)
1:20PM	Second engineer comes along and notices the user image is not using lockfiles, which explains why recent changes might have produced a different environment solve.
1:57PM	Engineer confirms that in the latest solve, the sqlite pkg version was downgraded, while libsql pkg was not
2:22PM	Engineer opens up a fix PR in the community-maintained repository, holding the user image. They then let the community know about the fix. - 3:05 PM
7:41PM	Community merges the PR and lets us know that the issue still persists
8:55PM	Third engineer tries to reproduce the error, but they can’t. Suggests the problem might be because of the use of the `latest` tag for the user image.
9:05PM	Community rep tries spawning a server from a different server and observes that the issue is gone
9:08PM	Engineer updates the image_pull_policy to `Always` to make sure the latest image is always pulled onto the node. They then let the community know that it takes time after the PR merge for the process to finish and they’ve updated the infrastructure to improve it.
10:44PM	Community confirms everything is ok
3:15PM	Outage is closed