OOM kill of enforce-xfs-quota on neurohackademy
| Field | Value |
|---|---|
| Impact Time | Oct 16 at 15:00 to Oct 16 at 16:14 |
| Duration | 1h 14m |
Overview¶
The enforce-xfs-quota container experienced OOM following quota changes
What Happened¶
The hard quota for the neurohackademy prod hub was modified, leading to an update of all projects. The generator script encountered an OOM error when processing a large directory (_shared).
Resolution¶
Running the script in a debug container was successful, which brought the filesystem attributes into agreement with their expected values. This removes the need for the generator loop to reconcile any projects.
Where We Got Lucky¶
The engineer noticed this when comparing configurations between clusters.
Action Items¶
Modify enforce-xfs-quota script to reduce likelihood of OOM.
Alert [FIRING:1] storage-quota-home-nfs pod has restarted neurohackademy prod (storage-quota-home-nfs-8544b55ddc-sk4bq same day action needed) was automatically added to this incident.
INCIDENT #1561
[FIRING:1] storage-quota-home-nfs pod has restarted neurohackademy prod (storagequota-home-nfs-857bff8d78-7j6gn same day action needed)
Timeline¶
| Time | Event |
|---|---|
| 3:42 PM | An engineer notices the enforce-xfs-quota container restarting |
| 3:47 PM | The engineer creates a debug container and runs the generator |
| 4:00 PM | The engineer successfully ran the script, and investigated why it was crashing for the main container |