LEAP out of GPU quota
| Field | Value |
|---|---|
| Impact Time | Aug 26 at 11:31 to Aug 26 at 13:45 |
| Duration | 2h 13m |
Overview¶
The LEAP project was seeing heavy GPU usage and was running out of GPU quota, triggering some user server start failures. We asked for more quota Yuvi Panda and this was granted.
Where We Got Lucky¶
Aug 26 at 11:31 to Aug 26 at 13:45
What Went Well¶
Our new alerts for user server startup failure fired, and investigating it to prevent false positives surfaced a new issue for us 2h 13m 22s
What Didn’t Go So Well¶
We don’t have a specific alert just for quotas being close to done, as that would have prevented this issue rather than resolve it after the fact. .
Action Items¶
Add alerts for any quota being close to 90% full 2i2c
-org /infrastructure #2265
Timeline¶
Aug 26, 2025¶
| Time | Event |
|---|---|
| 11:21 AM | User tries to spawn a GPU server, but new nodes immediately fail due to lack of quota. Visible in logs at https:// |
| 11:31 AM | GPU server spawn fails after 10minutes |
| 1:37 PM | Additional GPU quota requested via the Cloud Console and immediately granted |
| 1:44 PM | Community informed of this action via Freshdesk (https:// |