Resource exhaustion on us-west2 for HHMI
| Field | Value |
|---|---|
| Impact Time | Feb 12 at 01:48 to Feb 13 at 16:37 |
| Duration | 1d 14h 48m 54s |
Overview¶
GCP zone repeatedly ran out of resources
What Happened¶
The autoscaler failed to bring up new nodes to meet demand, due to exhaustion of resources in GCP zone. We saw ZONE_RESOURCE_POOL_EXHAUSTED errors in our cloud logs for the user-pod nodepool
Resolution¶
Subsequently we reduced the resource requirements of the usernodes (fitting fewer pods per node) so that we’re more likely to acquire one.
What Didn’t Go So Well¶
We didn’t get alerts for exhaustion events, only the symptom (repeated failed server starts)
Action Items¶
Triggered through the API.
Description: [FIRING:1] Two servers failed to start in the last 30m hhmi staging (immediate action needed) (View Message)
INCIDENT #1871
[FIRING:1] Two servers failed to start in the last 30m hhmi staging (immediate action needed)
Timeline¶
| Time | Event |
|---|---|
| 7:21 PM | Engineer observes that GCP is short on resources in zone Note added by Angus. Resolution Note: GCP ran out of resources in us-west2-b [FIRING:1] Two servers failed to start in the last 30m hhmi staging (immediate action needed) |