Skip to article frontmatterSkip to article content

Using CryoCloud S3 Scratch Bucket

CryoCloud JupyterHub has a preconfigured S3 “Scratch Bucket” that automatically deletes files after 7 days. This is a great resource for experimenting with large datasets and working collaboratively on a shared dataset with other CryoCloud users.

Access the scratch bucket

The CryoCloud scratch bucket is hosted at s3://nasa-cryo-scratch. CryoCloud JupyterHub automatically sets an environment variable SCRATCH_BUCKET that appends a suffix to the s3 url with your GitHub username. This is intended to keep track of file ownership, stay organized, and prevent users from overwriting data!

We’ll use the S3FS Python package, which provides a nice interface for interacting with S3 buckets.

import os
import s3fs
import fsspec
import boto3
import xarray as xr
import geopandas as gpd
# My GitHub username is `scottyhq`
scratch = os.environ['SCRATCH_BUCKET']
scratch 
's3://nasa-cryo-scratch/scottyhq'
# Here you see I previously uploaded files
s3 = s3fs.S3FileSystem()
s3.ls(scratch)
['nasa-cryo-scratch/scottyhq/ATL03_20230103090928_02111806_006_01.h5', 'nasa-cryo-scratch/scottyhq/IS2_Alaska.parquet', 'nasa-cryo-scratch/scottyhq/Notes.txt', 'nasa-cryo-scratch/scottyhq/example', 'nasa-cryo-scratch/scottyhq/example_ATL03', 'nasa-cryo-scratch/scottyhq/grandmesa-sliderule.parquet']
# But you can set a different S3 object prefix to use:
scratch = 's3://nasa-cryo-scratch/octocat-project'
s3.ls(scratch)
[]

Uploading data

It’s great to store data in S3 buckets because this storage features very high network throughput. If many users are simultaneously accessing the same file on a spinning networked harddrive (/home/jovyan/shared) performance can be quite slow. S3 has much higher performance for such cases.

Single file

# I'm working with this file downloaded from NSIDC:
local_file = '/tmp/ATL03_20230103090928_02111806_006_01.h5'

remote_object = f"{scratch}/ATL03_20230103090928_02111806_006_01.h5"

s3.upload(local_file, remote_object)
[None]
s3.stat(remote_object)
{'ETag': '"489f0191a8e9c844576ff2d18adfea59-21"', 'LastModified': datetime.datetime(2023, 7, 21, 19, 4, 55, tzinfo=tzutc()), 'size': 1063571816, 'name': 'nasa-cryo-scratch/octocat-project/ATL03_20230103090928_02111806_006_01.h5', 'type': 'file', 'StorageClass': 'STANDARD', 'VersionId': None, 'ContentType': 'application/x-hdf5'}

Directory

local_dir = '/tmp/example'

!ls -lh {local_dir}
total 8.0K
-rw-r--r-- 1 jovyan jovyan 22 Jul 20 23:26 data.txt
-rw-r--r-- 1 jovyan jovyan 11 Jul 20 23:26 icesat.csv
s3.upload(local_dir, scratch, recursive=True)
[None, None]
s3.ls(f'{scratch}/example')
['nasa-cryo-scratch/octocat-project/example/data.txt', 'nasa-cryo-scratch/octocat-project/example/icesat.csv']

Accessing Data

Some software packages allow you to stream data directly from S3 Buckets. But you can always pull objects from S3 and work with local file paths.

This download-first, then analyze workflow typically works well for older file formats like HDF and netCDF that were designed to perform well on local hard drives rather than Cloud storage systems like S3.

local_object = '/tmp/test.h5'
s3.download(remote_object, local_object)
[None]
ds = xr.open_dataset(local_object, group='/gt3r/heights')
ds
Loading...
fs = fsspec.filesystem("simplecache", 
                       cache_storage='/tmp/files/',
                       same_names=True,  
                       target_protocol='s3',
                       )
# The `simplecache` setting above will download the full file to /tmp/files
print(remote_object)
with fs.open(remote_object) as f:
    ds = xr.open_dataset(f.name, group='/gt3r/heights') # NOTE: pass f.name for local cached path
s3://nasa-cryo-scratch/octocat-project/ATL03_20230103090928_02111806_006_01.h5
ds
Loading...

Cloud-optimized formats

Other formats like COG, ZARR, Parquet are ‘Cloud-optimized’ and allow for very efficient streaming directly from S3. In other words, you do not need to download entire files and instead can easily read subsets of the data.

The example below reads a Parquet file directly into memory (RAM) from S3 without using a local disk:

gf = gpd.read_parquet('s3://nasa-cryo-scratch/scottyhq/IS2_Alaska.parquet')
gf.head(2)
Loading...

Advanced: Access Scratch bucket outside of JupyterHub

Let’s say you have a lot of files on your laptop you want to work with on CryoCloud. The S3 Bucket is a convient way to upload large datasets for collaborative analysis. To do this, you need to copy AWS Credentials from the JupyterHub to use on other machines. More extensive documentation on this workflow can be found in this repository https://github.com/scottyhq/jupyter-cloud-scoped-creds.

The following code must be run on CryoCloud JupyterHub to get temporary credentials:

client = boto3.client('sts')

with open(os.environ['AWS_WEB_IDENTITY_TOKEN_FILE']) as f:
    TOKEN = f.read()

response = client.assume_role_with_web_identity(
    RoleArn=os.environ['AWS_ROLE_ARN'],
    RoleSessionName=os.environ['JUPYTERHUB_CLIENT_ID'],
    WebIdentityToken=TOKEN,
    DurationSeconds=3600
)

reponse will be a python dictionary that looks like this:

{'Credentials': {'AccessKeyId': 'ASIAYLNAJMXY2KXXXXX',
  'SecretAccessKey': 'J06p5IOHcxq1Rgv8XE4BYCYl8TG1XXXXXXX',
  'SessionToken': 'IQoJb3JpZ2luX2VjEDsaCXVzLXdlc////0dsD4zHfjdGi/0+s3XKOUKkLrhdXgZ8nrch2KtzKyYyb...',
  'Expiration': datetime.datetime(2023, 7, 21, 19, 51, 56, tzinfo=tzlocal())},
  ...

You can copy and paste the values to another computer, and use them to configure your access to S3:

s3 = s3fs.S3FileSystem(key=response['Credentials']['AccessKeyId'],
                       secret=response['Credentials']['SecretAccessKey'],
                       token=response['Credentials']['SessionToken'] )
# Confirm your credentials give you access
s3.ls('nasa-cryo-scratch', refresh=True)
['nasa-cryo-scratch/octocat-project', 'nasa-cryo-scratch/scottyhq', 'nasa-cryo-scratch/sliderule-example']