Improvements to Conflict Resolution in nbgitpuller

Background¶

nbgitpuller is a Jupyter Server extension that exposes a mechanism for synchronising remote content with the server’s local file-system. In the wild, its primary application lies in connecting JupyterHub users with hub-adjacent content through a simple distributable, user-friendly interface (URLs). By virtue of pulling remote content within an individual user’s server, it is often used to facilitate the separation of content from compute-environment definitions in contexts like JupyterHub and BinderHub, where rebuilds of the single-user environment are costly and slow.

There are two main personas that use nbgitpuller:

Link-author: People creating content that can be shared via an nbgitpuller link.
Link-consumer: People that use an nbgitpuller link to access shared content.

Between fetching remote content and merging conflicts with local edits, there are many ways in which nbgitpuller users can encounter errors during normal operation. Fixing these errors is neither the responsibility of link-author nor link-consumers. Instead, there is a third persona:

nbgitpuller expert: People with the technical expertise to debug problems encountered during nbgitpuller usage.

Every problem that requires the intervention of an nbgitpuller expert introduces a dependency upon the availability of the expert, limiting the scalability of nbgitpuller. Reducing the necessity of this role, e.g. by improving conflict resolution, represents a desirable goal for the project.

Technical details¶

nbgitpuller operates as a Jupyter Server extension that exposes a number of request handlers:

GET /git-pull/api — an API service endpoint
GET /git-pull/ — a user-facing UI for triggering and following a git pull operation.

The UI served at /git-pull/ communicates with the API backend from the front-end using server-sent-events.

When used alongside a JupyterHub, there is a strong separation of concerns between provisioning of the compute environment (JupyterHub and e.g. KubeSpawner) and provisioning of the file-system (nbgitpuller). Using the /hub/user-redirect/ endpoint, content authors can craft user-agnostic URLs that invoke the nbgitpuller service.

The nbgitpuller URL handler (e.g. GET /git-pull?repo=...) implements several operations to fulfil a request:

Remote content is fetched from a Git repository scoped to a specific branch (fetch).
Fetched content is merged with the local file-system, resolving any conflicts in an opinionated manner to minimise user-input (merge).
Redirect user to given URL path once (1) and (2) have been completed (open).

Deliverables¶

Identify common `nbgitpuller` merge errors¶

Overview¶

After fetching content from a content-source, nbgitpuller is responsible for unifying the remote content with the local user’s filesystem (see (2) above). Where the link-consumer and link-author have each made edits to a remote file, it may be possible to account for both sets of changes in a lossless merge operation. However, there are some situations in which it is not possible to merge both the remote and local changes in a conflict-free manner. On these ocasions, nbgitpuller should resolve conflicts by preferring the remote content, whilst also preserving the link-consumer’s edits.

The nature of these kinds of failures means that they’re often content-dependent, and there are anecdotal reports of nbgitpuller failing to properly resolve merge conflicts in the wild. These kinds of failures are difficult for link-author and link-consumer personas to resolve; often this requires intervention from the nbgitpuller-persona. Through inspection of logs from existing (large) nbgitpuller deployments, we will learn more about these kinds of failures in real-world deployments. .

Definition of done¶

An array of structured nbgitpuller events has been generated from existing large JupyterHub deployments logs.
A set of common error types has been established from analysis of nbgitpuller event information.
An array of structured nbgitpuller events has been generated from existing large JupyterHub deployments logs.
A set of common error types has been established from analysis of nbgitpuller event information.
A set of reproducible merge failures has been identified.

Estimates¶

Task	Lower Estimate	Upper Estimate
Generate structured events from raw logs	6h	10h
Analyse nbgitpuller events to identify common error types	4h	8h
Open pull-request and shepherd through to merge	4h	8h
Additional learning and refinement	2h	6h
Total	16h	32h

Implement fixes to Git-based merge routines¶

Overview¶

Following the work in the first deliverable, a set of reproducible merge failures will have been identified. Subsequently, work may be done to reduce the likelihood of these kinds of failures in order; by hardening nbgitpuller against failure during nominal usage, it may be possible to eliminate and/or diminish in importance the nbgitpuller-expert persona.

Alongside implementing fixes for these newly identified merge-failure scenarios, work should be done to embed reproducible test-cases in the nbgitpuller test suite.

Definiton of done¶

A set of reproducible merge failures has been prepared as a test-suite.
The extended nbgitpuller test suite passes.

Estimates¶

Task	Lower Estimate	Upper Estimate
Create reproducible tests for existing merge-failures	8h	12h
Implement fixes for these test failures	12h	20h
Open pull-request and shepherd through to merge	4h	8h
Additional learning and refinement	2h	6h
Total	26h	46h

Additional overheads¶

In addition to per-deliverable work, there is up-front work that may be paid by each developer:

Task	Lower Estimate	Upper Estimate
Become familiar with nbgitpuller architecture	4h	8h
Set up development environment	2h	3h
Total	6h	11h

We will assume that two separate developers incur this cost.

Intentionally out of scope¶

For this statement of work, we are leaving the following as intentionally out of scope:

Use of alternative conflict resolution mechanisms besides Git.

Relevant GitHub Issues and external links¶

Listed below are pertinent GitHub Issues open in the jupyerhub/nbgitpuller repository, and other external resources:

People working on this¶

This project would require capacity from:

App Engineer (1 implementation, 1 review)

Improvements to Conflict Resolution in nbgitpuller

Background¶

Technical details¶

Deliverables¶

Identify common nbgitpuller merge errors¶

Overview¶

Definition of done¶

Estimates¶

Implement fixes to Git-based merge routines¶

Overview¶

Definiton of done¶

Estimates¶

Additional overheads¶

Intentionally out of scope¶

Relevant GitHub Issues and external links¶

People working on this¶

Timeline¶

Identify common `nbgitpuller` merge errors¶