LIS hub cannot scale
| Field | Value |
|---|---|
| Impact Time | Nov 14 at 02:00 to Nov 14 at 08:30 |
| Duration | 6h 30m |
Overview¶
At 10AM UK time the B zone in Google Cloud’s europe-west2 region (London) ran out of resources to allocate. While we deploy regional clusters, we restrict the nodes to a specific zone colocated with our NFS Filestore to maximise performance. This resulted in the cluster not being able to scale and a “backoff after failed scale-up” message reported to the user.
What Happened¶
User reported “backoff after failed scale-up” error
This is usually related to quotas so engineer checked those and increased the quotas for Persistent SSD disk (which was red) and the CPUs. Neither of these worked.
Eventually the ZONE_RESOURCE_POOL_EXHAUSTED error was found in logs
Engineers confirmed this meant Google had no more resources in the requested zone
The affected node pool was edited in the GCP console to allow node creation in the A and C zones within the region
This resolved the problem
Resolution¶
Incident resolved by allowing the node pool to create nodes in the A and C zones which weren’t exhausted. This is a temporary fix as it now means that the NFS Filestore is working across zones, which will impact performance.
What Didn’t Go So Well¶
Single engineer working on this issue wearing many hats (incident commander, comms, debugging) for large period of the incident. A team response would’ve provided more support. Pages also didn’t get through to response engineers which compounded the above issue
Action Items¶
Do EITHER (1) OR (2)
Restrict node pool back to only B zone after Black Friday event is over
Move the cluster and the NFS to the A zone
GitHub issue: 2i2c
Investigate an enterprise plan which allows for regional NFS (would cost more)
GitHub issue: 2i2c
Ensure PagerDuty notifications can always get through to engineers
GitHub issue: 2i2c
Timeline¶
Nov 14, 2022¶
| Time | Event |
|---|---|
| 6:46AM | Sarah Gibson #managed_jupyte_inc_36 I don’t think that worked, because even though the resizing completed successfully in the console, k get nodes didn’t show a new node |
| 6:46AM | Sarah Gibson #managed_jupyte_inc_36 And number of nodes has remained at 0 in the console |
| 6:47AM | Sarah Gibson #managed_jupyte_inc_36 Will try deleting the hub pod |
| 6:48AM | Sarah Gibson #managed_jupyte_inc_36 Didn’t work either |
| 8:01AM | Georgiana Dolocan #managed_jupyte_inc_36 The first ZONE_RESOURCE_POOL_EXHAUSTED was at 10 amuktime |
| 8:29AM | Sarah Gibson #managed_jupyte_inc_36 Ooooh, I think that worked!!!! |
| 8:47AM | yuvipanda #managed_jupyte_inc_36 Looks like they ran out of cloud |
| 8:51AM | yuvipanda #managed_jupyte_inc_36 I see the new nodes are in europe-west2-c |
| 2:20AM | Resolved by Sarah Gibson through the website. INCIDENT #36 LIS hub cannot scale |