Skip to article frontmatterSkip to article content

LIS hub cannot scale

FieldValue
Impact TimeNov 14 at 02:00 to Nov 14 at 08:30
Duration6h 30m

Overview

At 10AM UK time the B zone in Google Cloud’s europe-west2 Sarah Gibson region (London) ran out of resources to allocate. While we deploy regional clusters, we restrict the nodes to a specific zone colocated with our NFS Filestore to maximise performance. This resulted in the cluster not being able to scale and a “backoff after failed scale-up” message reported to the user. Nov 14 at 02:00 to Nov 14 at 08:30

What Happened

Resolution

Incident resolved by allowing the node pool to create nodes in the A and C zones which weren’t exhausted. This is a temporary fix as it now means that the NFS Filestore is working across zones, which will impact performance.

What Didn’t Go So Well

Single engineer working on this issue wearing many hats (incident commander, comms, debugging) for large period of the incident. A team response would’ve provided more support. Pages also didn’t get through to response engineers which compounded the above issue

Action Items

Do EITHER (1) OR (2)

  1. Restrict node pool back to only B zone after Black Friday event is over

  2. Move the cluster and the NFS to the A zone GitHub issue: 2i2c-org/infrastructure#1944

  3. Investigate an enterprise plan which allows for regional NFS (would cost more) GitHub issue: 2i2c-org/infrastructure#1945

  4. Ensure PagerDuty notifications can always get through to engineers GitHub issue: 2i2c-org/team-compass#574