Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Usage Quotas on JupyterHub

Problem statement

Since cloud compute pricing is elastic, cloud compute usage directly correlates to cost - the higher any particular user’s usage, the higher the cloud cost. This is great because if you use very little, you don’t pay much. This is terrible because if you use a lot, you pay a lot! JupyterHub admins need control over how much any individual user can cost them.

This can be done by providing JupyterHub admins control over how much compute any individual user can use over time.

Definition of Done (Overall, not just for initial SOW / phase 1)

When fully implemented, the solution would have the following components:

  1. Groups to which users can belong

  2. A way for admins to easily determine which users belong to which groups

  3. Compute Quotas, with a unit of Resource * Time (So CPUHours, MemoryHours and GPUHours)

  4. A way for admins to associate particular groups with a particular compute quota

  5. A way for admins to see a overall report of users and their quota usage

  6. A way to enforce these quotas, so users can not use more than that

  7. A way for users to be aware of their existing quota usage, and modify their behavior accordingly

Let’s break down what the workflow would look like.

Example administrator workflow

An administrator has decided through internal control mechanisms that users would need to be split into four groups:

  1. regular

  2. power

  3. dev

  4. gpu-dev

They can easily segment users in whatever authentication system they have into these groups (for example, via GitHub teams or auth0 groups/roles). Further, they have a way to specify the following quotas for each group:

GroupMemoryHours (in GiB * Hr)GPUHours (in GPU count * Hr)
regular600
power2400
dev6400
gpu-dev640160

They can adjust these quotas as they wish over time. There is also an interface that allows them to see how users are currently using up their various quotas.

Example user workflow

The user logs into the JupyterHub, and they launch a server with 4GB of RAM to use for some work. They keep it running for about 4 hours, are done with their work, and leave. They don’t shut it down manually, but the idle culler reclaims it after 60min of idle time. This counts as 20 GiBHour (4GiB * 5Hr) in their quota.

They are able to take a look at their quota usage through a web interface, and notice they have already used 50 of their 60 GiBHour for this month. They recognize they need more, so they reach out to an admin to ask for more quota. After recognizing this need, the admin moves them to the power user group (from regular), and they now have 240 GiB Hour for this month.

When the user goes over their quota, their existing servers are shut down and new servers unable to be started. They can either contact their admin for more quota, or wait out until their quota refills (a rolling 30d period by default).

Out of scope

Questions to be answered

Phases

We split the overall work into multiple phases that build on each other, and provide value to communities at the end of each phase. Doing so allows us to make progress without striving for perfection, and makes each piece of work more manageable. Each phase should:

  1. Take into account future desired phases so we don’t end up making architectural, technical or social choices that lead us down a dead end.

  2. Be detailed enough to be roughly estimatable

  3. Provide value to end users upon completion

This document primarily works on Phase 1.

Phase 1

Phase 1 is foundational, and sets up the baseline work needed for unified quota management throughout.

Definition of Done

Intentionally out of scope for this phase

These are pieces that are intentionally out of scope for this phase, but expected to be supported in a future phase. Since phase 1 is foundational, care will be taken to make sure we don’t accidentally design ourselves into a corner that prevents us from acheiving the following in the future.

Components

  1. Source of truth about usage. A data store that can be reliably counted upon to have accurate information on how much any user has used any supported resource over a particular period of time.

  2. A source of truth for quota configuration. Something that describes rules for deciding what users have access to how much resources over time.

  3. Logic to check ‘is the user allowed to do this?’, where ‘this’ is ‘start a new server’ or ‘keep this server running’. This is the mechanism that applies the policy described in (2) to the data available from (1).

  4. Hooks in JupyterHub that reach out to (3) during server start, and if the user isn’t allowed, provide a useful error message

  5. Same as (4) but for dask

  6. Same as (4), but on an ongoing basis so that users running servers are killed when they reach over quota

  7. A simple way for users to know what quota they have, and how much is left.

Note: Components 5 and 6 are not under consideration for Phase 1.

Phase 1: Design considerations

What should source of usage truth be? (component 1)

The obvious answer is prometheus. But right now, prometheus is not part of the critical path of anything - if prometheus is down, users don’t actually see any issues. We can also wipe prometheus data and not see problems anywhere other than in reporting. Since prometheus as we have it deployed is pretty big, we should not put that in the critical path.

Our two options here are:

  1. Run a different prometheus instance that only collects an allow-listed set of metrics that we desire

  2. Write our own collector that keeps state in a database.

Does this data source need to be HA? It needs to be as available as the hub.

How should quota rules be represented? (component 2)

To start with, we only want a mapping of group name -> quota for a resource. This can be in two places:

  1. YAML config in a repository

  2. a GUI where admins can set things up

We can start with (1) and progress to (2) as desired.

Where should logic for checking quotas live? (component 3)

Since this has to be used from multiple different components, there are two ways to do this:

  1. A service architecture, where we implement a single service that other components (like hub, dask gateway, etc) can talk to over HTTP to get their answers

  2. A library-component architecture, where we implement a python package that can talk to (2) and (1) to figure out the answer. This will be in turn used by (4), (5), (6) and (7) as a library.

Both have pros and cons, and the answer to this probably also depends on (2).

Hooks in JupyterHub that reach out to (3) during server start, and if the user isn’t allowed, provide a useful error message (component 4)

This needs to happen before the kubespawner kicks in and asks for resources. Pre-spawn hooks are available:

How can users know what quota they have and how much is left? (component 7)

This should be made available as a JupyterHub service providing a web page that users can check. It should only show them their own quota information. This may also be available via an API that gets integrated into JupyterLab in the future.

Summary of discussion between Yuvi & Jenny

  1. Use a separate tuned prometheus as source of usage data.

  2. Keep mapping of quota to groups as YAML

  3. Build an async python library that can make quota decisions, with an eye to eventually turning this into a service if necessary.

  4. Find or build hooks in JupyterHub to be able to check quota before a spawn and provide a message to user if needed.

  5. Build a JupyterHub service that lets users check their allowed quota and existing usage.

Phase 1: Deliverables

Deliverable 1. Setup a prometheus specifically for quota system use

Overview

Prometheus is a time series database that we will be using as our ‘source of truth’ for answering ‘how much has a user X used resource Y in the last Z time period’. Prometheus uses exporters to ‘pull’ this information. The first deliverable is making sure we are reliably collecting all the data we need to enforce quotas, leveraging existing software wherever possible.

Definition of done

Estimates (56-68h)

Risk factors

  1. Metric for mapping users to resource usage does not exist and needs JupyterHub or custom exporter work

    • Mitigation: Quick check shows JupyterHub already sets correct annotations. Requires Infra Eng work to pick that up in prometheus only, no app eng work needed.

People needed

  1. App Eng to decide what metrics are needed, and build additional exporters if necessary

  2. Infra Eng to set up the exporters and prometheus in an production ready way

Notes

This prometheus server is now on the critical path to server startup, unlike the prometheus server we already run (which is only used for reporting). We need to make a choice on fallback in case this prometheus server is down - either fallback to allowing everyone, or blocking everyone (Yuvi’s preferred approach). We can make this choice on a per-hub basis.

Demo reels

  1. Prometheus with all metrics we care about running in production (GIF)

Deliverable 2. Build a python library to make quota decisions

Overview

The core of the quotaing system consists of:

  1. A way to declaritively specify quotas, consisting of:

    • A resource (lime RAM or CPU)

    • A rolling time duration (last 30 days)

    • A limit expressed in terms of a Resource * Time (GiBHours or CPUHours)

  2. Based on this configuration, a way to ask ‘this user wants to use X units of resource Y. Are they allowed to do it?’

We will implement a async friendly python library that can answer this question. It’ll take the following inputs:

  1. Quota configuration (as YAML / Traitlets)

  2. Access to a prometheus server (Deliverable 1)

  3. The name of the user

  4. The list of groups the user is part of

  5. What resources (RAM / CPU) they are requesting

And provide as output:

  1. A yes/no on wether they are allowed to access this response

To do this, it would need to:

  1. Figure out exactly what quotas apply to this particular user, based on the groups they belong to and the quota configuration

  2. Reach out to the prometheus server to figure out their usage

  3. Perform logical checks to figure out if they have quota left or not

Definition of Done

Estimates

Task

Lower Estimate

Upper Estimate

PromQL exploration

4h

4h

Quota schema definition

10h

10h

JupyterHub API integration

4h

4h

Core quota logic

24h

32h

Integration testing infrastructure + setup

24h

32h

Documentation

12h

24h

Package publishing

4h

4h

Total

82h

110h

Notes

  1. This library should not be tied to any specific kubernetes concepts. That allows it to be used in the future outside either JupyterHub or kubernetes as needed, drastically improving chances of it being accepted upstream.

  2. By writing it in async python from the start, we can use it in-line in all the places we need (JupyterHub hooks, dask-gateway, etc). It can also be turned into a network based service if needed.

  3. This is security sensitive code and should be treated as such.

  4. The quota configuration schema should also be usable to provide more direct information about quota usage for Deliverable 3

Risk factors

This library provides critical functionality to enable usage quotas. If this piece does not work then we have to rethink our entire technical approach. Ways this could not work:

  1. Prometheus data is not reliable enough to make quota logic decisions -> rethink deliverable 1

  2. Quota decisions cannot be made in real-time, so there will be potential overages we need to explain

  3. There is a certain amount of exploratory work here that could snowball effort estimates

People needed

  1. App engineers to build the library

Demo Reel

  1. Commandline example showing whether a user’s quota request for a particular size server would be allowed or not (GIF?)

Deliverable 3: JupyterHub service for users to check their own quota

Overview

End users need a way to:

  1. Know what quota limitations they are subject to

  2. How much of their quota they have used so far

We will build a web application that is a JupyterHub service for users to check this for themselves.

Intentionally out of Scope

For this deliverable, we are leaving the following as intentionally out of scope:

  1. Visualizations of usage over time. Users will only get numbers, no charts or graphs.

  2. No integration with JupyterLab, this would be a separate web page users would need to go to.

  3. No integration with storage quotas for now, only CPU / Memory.

All these are possible features to be added in future phases, so our design needs to accomodate them.

Definition of Done

Estimates

Task

Lower Estimate

Upper Estimate

Setting up the base JupyterHub service with auth

12h

12h

Setting up the base frontend with dependencies & packaging

12h

12h

Design and mockup of UI

10h

10h

Build backend application

24h

32h

Build frontend application

24h

32h

Documentation

8h

8h

Package publishing

4h

4h

Total

94h

110h

People needed

Notes

  1. This should be exactly as generic as the library in deliverable 2. Could possibly specify warnings if user is close to their quota.

  2. Should have decent explanations for users to understand how the quota is calculated

  3. Should re-use as much code as possible from deliverable 2.

  4. Users should be only able to see their own quota and usage - this is a security boundary.

  5. This is going to be a python backend (tornado) providing an API to be consumed by a JS frontend

Demo reel

  1. Earthscope staging hub link where users can see how much quota they have used (real live demo)

Deliverable 4: Improve the ‘Spawn Progress’ page on JupyterHub

Overview

When a user attempts to start a JupyterHub server after making a profile selection, they are shown a ‘progress’ page that shows them status messages about how the spawn is going. In quota enabled systems, this is a great place in the UX for two things:

  1. If they have enough quota, to show how much quota they have used.

  2. If they don’t have enough quota, clearly show them a user friendly error message that tells them where to go next.

The current UX of this page is such that most users are observed to ignore it, primarily due to the following problems:

  1. The progress messages shown are directly from the underlying system (Kubernetes), and make no sense to most users. Do you know what 2025-03-19T00:51:47.961011Z [Warning] 0/3 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 1 No preemption victims found for incoming pod, 2 Preemption is not helpful for scheduling. means? It actually means Determined that a new node is required, yet it looks far scarier!

  2. Some of the messages actually mean the spawn has failed, but there’s no clear indication which messages indicate that vs which messages simply are markers of progress.

  3. The UX of the progressbar itself is pretty janky, with raw ISO formatted timestamps shown.

We want to improve the UX of this page so it’s useful and users will pay attention for it - this prevents ‘surprises’ as users run out of quota. It should also be made customizable enough so it can be used for quota purposes. This requires upstream contributions to JupyterHub.

This isn’t a complete overhaul of the page - only an incremental improvement + some customization hooks.

Definition of Done

Risk Factors

Estimates

Task

Lower Estimate

Upper Estimate

UX mockup

8h

8h

Human readable progress messages

24h

32h

Allowing hooks to inject progress messages

24h

32h

Allowing some progress messages to terminate spawn

24h

32h

Upstream co-ordination overhead

24h

32h

Total

104h

136h

People needed

Notes

Demo Reel

  1. More UX friendly spawn page for everyone on the earthscope production hub

Deliverable 5: Integrate Library from (2) into JupyterHub spawning process

Overview

During the JupyterHub spawning process, we know what amount of resources (Memory and CPU) the user is requesting. Based on the quota configuration, if this should be allowed:

  1. We note how much quota the user has consumed, and how much they have left as a ‘progress message’

If it should be denied:

  1. The server is not started

  2. A configurable message should be shown to them about this denial, with information on how they can request more quota (by being added to different groups)

Definition of done

Estimates

Task

Lower Estimate

Upper Estimate

Integration work

24h

32h

Documentation

8h

8h

Total

32h

40h

People needed

  1. App eng to build out all the hooks and functionality

  2. Infrastructure eng to roll this out to our infrastructure

Demo Reel

  1. Full quota system working and testable in the staging earthscope hub

Deliverable 6: Production roll-out

Overview

So far, we would have deployed to staging clusters and tested. We will need to

Definition of Done

Estimates

Task

Lower Estimate

Upper Estimate

Coordination with Earthscope

8h

8h

Support and monitoring

8h

40h

Total

16h

48h

Risk factors

People needed

  1. App eng to fix any issues that may arise

  2. Infrastructure eng to support app engineers in this process

Demo Reel

  1. Full quota system working and testable in the earthscope production hub

Cloud vendor considerations

There should be no cloud vendor specific parts here - everything should work across cloud vendors.

People working on this

This project would require capacity from:

  1. Tech Lead

  2. Infrastructure Engineer

  3. App Engineer

In addition, it would also consume cycles from our project management folks.

Timeline

Some of this work can be done in parallel, depending on availability of capacity.

Based on when we start this, I roughly expect 2-4 months to drive this to completion.