Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Resource exhaustion on us-west2 for HHMI

FieldValue
Impact TimeFeb 12 at 01:48 to Feb 13 at 16:37
Duration1d 14h 48m 54s

Overview

GCP zone repeatedly ran out of resources

What Happened

The autoscaler failed to bring up new nodes to meet demand, due to exhaustion of resources in GCP zone. We saw ZONE_RESOURCE_POOL_EXHAUSTED errors in our cloud logs for the user-pod nodepool

Resolution

Subsequently we reduced the resource requirements of the usernodes (fitting fewer pods per node) so that we’re more likely to acquire one.

What Didn’t Go So Well

We didn’t get alerts for exhaustion events, only the symptom (repeated failed server starts)

Action Items

Triggered through the API.

Description: [FIRING:1] Two servers failed to start in the last 30m hhmi staging (immediate action needed) (View Message)

INCIDENT #1871

[FIRING:1] Two servers failed to start in the last 30m hhmi staging (immediate action needed)

Timeline

TimeEvent
7:21 PMEngineer observes that GCP is short on resources in zone Note added by Angus. Resolution Note: GCP ran out of resources in us-west2-b [FIRING:1] Two servers failed to start in the last 30m hhmi staging (immediate action needed)