LIS hub cannot scale
| Field | Value |
|---|---|
| Impact Time | Nov 14 at 02:00 to Nov 14 at 08:30 |
| Duration | 6h 30m |
Overview¶
At 10AM UK time the B zone in Google Cloud’s europe-west2 Sarah Gibson region (London) ran out of resources to allocate. While we deploy regional clusters, we restrict the nodes to a specific zone colocated with our NFS Filestore to maximise performance. This resulted in the cluster not being able to scale and a “backoff after failed scale-up” message reported to the user. Nov 14 at 02:00 to Nov 14 at 08:30
What Happened¶
User reported “backoff after failed scale-up” error
This is usually related to quotas so engineer checked those and 6h 30m increased the quotas for Persistent SSD disk (which was red) and the CPUs. Neither of these worked.
Eventually the ZONE_RESOURCE_POOL_EXHAUSTED error was found in logs . in the requested zone
The affected node pool was edited in the GCP console to allow node creation in the A and C zones within the region
This resolved the problem
Resolution¶
Incident resolved by allowing the node pool to create nodes in the A and C zones which weren’t exhausted. This is a temporary fix as it now means that the NFS Filestore is working across zones, which will impact performance.
What Didn’t Go So Well¶
Single engineer working on this issue wearing many hats (incident commander, comms, debugging) for large period of the incident. A team response would’ve provided more support. Pages also didn’t get through to response engineers which compounded the above issue
Action Items¶
Do EITHER (1) OR (2)
Restrict node pool back to only B zone after Black Friday event is over
Move the cluster and the NFS to the A zone GitHub issue: 2i2c
-org /infrastructure #1944 Investigate an enterprise plan which allows for regional NFS (would cost more) GitHub issue: 2i2c
-org /infrastructure #1945 Ensure PagerDuty notifications can always get through to engineers GitHub issue: 2i2c
-org /team -compass #574