Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

OOM kill of enforce-xfs-quota on neurohackademy

FieldValue
Impact TimeOct 16 at 15:00 to Oct 16 at 16:14
Duration1h 14m

Overview

The enforce-xfs-quota container experienced OOM following quota changes

What Happened

The hard quota for the neurohackademy prod hub was modified, leading to an update of all projects. The generator script encountered an OOM error when processing a large directory (_shared).

Resolution

Running the script in a debug container was successful, which brought the filesystem attributes into agreement with their expected values. This removes the need for the generator loop to reconcile any projects.

Where We Got Lucky

The engineer noticed this when comparing configurations between clusters.

Action Items

Alert [FIRING:1] storage-quota-home-nfs pod has restarted neurohackademy prod (storage-quota-home-nfs-8544b55ddc-sk4bq same day action needed) was automatically added to this incident.

INCIDENT #1561

[FIRING:1] storage-quota-home-nfs pod has restarted neurohackademy prod (storagequota-home-nfs-857bff8d78-7j6gn same day action needed)

Timeline

TimeEvent
3:42 PMAn engineer notices the enforce-xfs-quota container restarting
3:47 PMThe engineer creates a debug container and runs the generator
4:00 PMThe engineer successfully ran the script, and investigated why it was crashing for the main container