LIS hub cannot scale - 2i2c Incident Reports

Field	Value
Impact Time	Nov 14 at 02:00 to Nov 14 at 08:30
Duration	6h 30m

Overview¶

At 10AM UK time the B zone in Google Cloud’s europe-west2 region (London) ran out of resources to allocate. While we deploy regional clusters, we restrict the nodes to a specific zone colocated with our NFS Filestore to maximise performance. This resulted in the cluster not being able to scale and a “backoff after failed scale-up” message reported to the user.

What Happened¶

User reported “backoff after failed scale-up” error
This is usually related to quotas so engineer checked those and increased the quotas for Persistent SSD disk (which was red) and the CPUs. Neither of these worked.
Eventually the ZONE_RESOURCE_POOL_EXHAUSTED error was found in logs
Engineers confirmed this meant Google had no more resources in the requested zone
The affected node pool was edited in the GCP console to allow node creation in the A and C zones within the region
This resolved the problem

Resolution¶

Incident resolved by allowing the node pool to create nodes in the A and C zones which weren’t exhausted. This is a temporary fix as it now means that the NFS Filestore is working across zones, which will impact performance.

What Didn’t Go So Well¶

Single engineer working on this issue wearing many hats (incident commander, comms, debugging) for large period of the incident. A team response would’ve provided more support. Pages also didn’t get through to response engineers which compounded the above issue

Action Items¶

Do EITHER (1) OR (2)

1. Restrict node pool back to only B zone after Black Friday event is over
1. Move the cluster and the NFS to the A zone

GitHub issue: 2i2c-org/infrastructure#1944

1. Investigate an enterprise plan which allows for regional NFS (would cost more)

GitHub issue: 2i2c-org/infrastructure#1945

1. Ensure PagerDuty notifications can always get through to engineers

GitHub issue: 2i2c-org/team-compass#574

Timeline¶

Nov 14, 2022¶

Time	Event
6:46AM	Sarah Gibson #managed_jupyte_inc_36 I don’t think that worked, because even though the resizing completed successfully in the console, `k get nodes` didn’t show a new node
6:46AM	Sarah Gibson #managed_jupyte_inc_36 And number of nodes has remained at 0 in the console
6:47AM	Sarah Gibson #managed_jupyte_inc_36 Will try deleting the hub pod
6:48AM	Sarah Gibson #managed_jupyte_inc_36 Didn’t work either
8:01AM	Georgiana Dolocan #managed_jupyte_inc_36 The first `ZONE_RESOURCE_POOL_EXHAUSTED` was at 10 amuktime
8:29AM	Sarah Gibson #managed_jupyte_inc_36 Ooooh, I think that worked!!!!
8:47AM	yuvipanda #managed_jupyte_inc_36 Looks like they ran out of cloud
8:51AM	yuvipanda #managed_jupyte_inc_36 I see the new nodes are in `europe-west2-c`
2:20AM	Resolved by Sarah Gibson through the website.