Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

LIS hub cannot scale

FieldValue
Impact TimeNov 14 at 02:00 to Nov 14 at 08:30
Duration6h 30m

Overview

At 10AM UK time the B zone in Google Cloud’s europe-west2 region (London) ran out of resources to allocate. While we deploy regional clusters, we restrict the nodes to a specific zone colocated with our NFS Filestore to maximise performance. This resulted in the cluster not being able to scale and a “backoff after failed scale-up” message reported to the user.

What Happened

Resolution

Incident resolved by allowing the node pool to create nodes in the A and C zones which weren’t exhausted. This is a temporary fix as it now means that the NFS Filestore is working across zones, which will impact performance.

What Didn’t Go So Well

Single engineer working on this issue wearing many hats (incident commander, comms, debugging) for large period of the incident. A team response would’ve provided more support. Pages also didn’t get through to response engineers which compounded the above issue

Action Items

Do EITHER (1) OR (2)

GitHub issue: 2i2c-org/infrastructure#1944

GitHub issue: 2i2c-org/infrastructure#1945

GitHub issue: 2i2c-org/team-compass#574

Timeline

Nov 14, 2022

TimeEvent
6:46AMSarah Gibson #managed_jupyte_inc_36 I don’t think that worked, because even though the resizing completed successfully in the console, k get nodes didn’t show a new node
6:46AMSarah Gibson #managed_jupyte_inc_36 And number of nodes has remained at 0 in the console
6:47AMSarah Gibson #managed_jupyte_inc_36 Will try deleting the hub pod
6:48AMSarah Gibson #managed_jupyte_inc_36 Didn’t work either
8:01AMGeorgiana Dolocan #managed_jupyte_inc_36 The first ZONE_RESOURCE_POOL_EXHAUSTED was at 10 amuktime
8:29AMSarah Gibson #managed_jupyte_inc_36 Ooooh, I think that worked!!!!
8:47AMyuvipanda #managed_jupyte_inc_36 Looks like they ran out of cloud
8:51AMyuvipanda #managed_jupyte_inc_36 I see the new nodes are in europe-west2-c
2:20AMResolved by Sarah Gibson through the website. INCIDENT #36 LIS hub cannot scale