LEAP out of GPU quota
| Field | Value |
|---|---|
| Impact Time | Aug 26 at 11:31 to Aug 26 at 13:45 |
| Duration | 2h 13m 22s |
Overview¶
Add alerts for any quota being close to 90% full 2i2c
-org /infrastructure #2265
| The LEAP project | OWNEROFREVIEWPROCESS Yuvi Panda |
|---|---|
| was seeing heavy GPUusage and was running out of GPUquota, triggering some user server start failures. Weasked for more quota and this was granted. | |
| Where we got lucky | IMPACT TIME |
| Aug 26 at 11:31 to Aug 26 at 13:45 | |
| What Went Well? | |
| 1. Our new alerts for user server startup failure fired, and investigating it to prevent false positives surfaced a new issue for us | DURATION |
| 2h 13m 22s | |
| What Didn’t Go So Well? | |
| 1. Wedon’t have a specific alert just for quotas being close to done, as that would have prevented this issue rather than resolve it after the fact. | |
| Action Items |
Timeline
Aug 26, 2025
11:21 AM
User tries to spawn a GPU server, but new nodes immediately fail due to lack of quota. Visible in logs at https://
11:31 AM
Triggered through the API.
Description: [FIRING:1] Server Startup Failed leap prod (take immediate action) (View Message)
INCIDENT #1273
LEAP out of GPU quota
11:31 AM GPU server spawn fails after 10minutes
1:37 PM Additional GPU quota requested via the Cloud Console and immediately granted
1:44 PM Community informed of this action via Freshdesk (https://
2i2c .freshdesk .com /a /tickets /3795)
1:45 PM
Resolved by Yuvi Panda through the website.
INCIDENT #1273
LEAP out of GPU quota
Timeline¶
Aug 26, 2025¶
| Time | Event |
|---|---|
| 11:31 AM | GPU server spawn fails after 10minutes |
| 1:37 PM | Additional GPU quota requested via the Cloud Console and immediately granted |
| 1:44 PM | Community informed of this action via Freshdesk (https:// |