Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

[dubois:ephemeral] Unable to start servers

FieldValue
Impact TimeFeb 11 at 08:46 to Feb 11 at 22:44
Duration13h 58m

Overview

A community user image was updated from a spec without lockfiles, accidentally introducing a conflict between incompatible sqlite and libsqlite conda-forge packages. This prevented user servers from starting. The user image was using the latest tag, so the broken image was pulled in without audit.

What Happened

A series of startup failure alerts, and subsequent FreshDesk ticket were reported. The engineering team investigated the cause, and found the community image to be broken.

Resolution

Pinning the sqlite version in user image solves the issue. Long term systematic fix would be to use lockfiles in user images to avoid upstream package changes being pulled in with every image update..

Where We Got Lucky

The alerting system caught the problem and the incident happened on a timezone when most engineers were available.

What Went Well

  1. Once we acknowledged the incident, we found the issue pretty quickly

  2. The community had provided additional details via the ticket (like the user image repository) that we could use directly

What Didn’t Go So Well

  1. We found out that persistent pod logs were missing since recent k8s upgrade and pursuing this issue delayed incident resolution

  2. Community merged the PR pretty late in the day and additional guidance was needed for the fix to propagate

Action Items

  1. HigherEdData/Du-Bois-STEM#4

  2. 2i2c-org/infrastructure#7637

  3. 2i2c-org/infrastructure#7646

  4. 2i2c-org/infrastructure#7648

  5. 2i2c-org/docs#297

Timeline

TimeEvent
8:46AMAlert is triggered, but is outside working hours of engineers Assigned to Herbie.
9:05AMFreshDesk ticket is opened by the community representative, letting us know about not being able to start servers
11:30AMAlert gets reassigned from the bot account to an engineer
11:43AMEngineer searches through hub logs. Notices that persistent logs are missing since k8s upgrade two weeks prior. Concludes that prometheus chart version was not compatible with the new k8s version.
1:00PMEngineer tries starting a server and fails. User server logs show missing pysqlite2 module. Notices that the user image was updated recently and suspects the outage is related to the changes.
1:03PMEngineer notices support ticket from community rep
1:12PMAlert is acknowledged as an outage and given a P1 priority Priority set to ‘P1’ by Georgiana. INCIDENT #1869 [FIRING:1] Two servers failed to start in the last 30mdubois ephemeral (immediate action needed)
1:20PMSecond engineer comes along and notices the user image is not using lockfiles, which explains why recent changes might have produced a different environment solve.
1:57PMEngineer confirms that in the latest solve, the sqlite pkg version was downgraded, while libsql pkg was not
2:22PMEngineer opens up a fix PR in the community-maintained repository, holding the user image. They then let the community know about the fix.
3:05PMOutage is declared as resolved
7:41PMCommunity merges the PR and lets us know that the issue still persists
8:55PMThird engineer tries to reproduce the error, but they can’t. Suggests the problem might be because of the use of the latest tag for the user image.
9:05PMCommunity rep tries spawning a server from a different server and observes that the issue is gone
9:08PMEngineer updates the image_pull_policy to Always to make sure the latest image is always pulled onto the node. They then let the community know that it takes time after the PR merge for the process to finish and they’ve updated the infrastructure to improve it.
10:44PMCommunity confirms everything is ok
3:15PMOutage is closed