Skip to article frontmatterSkip to article content

Array Manipulation with Xarray

There are numerous ways to work with geospatial data and so choosing tooling can be difficult. The principal library we’ll be using is Xarray for its DataArray and Dataset data structures and associated utilities as well as NumPy and Pandas for manipulating homogeneous numerical arrays and tabular data respectively.

from warnings import filterwarnings
filterwarnings('ignore')
from pathlib import Path
import numpy as np, pandas as pd, xarray as xr
import rioxarray as rio

The principal data structure in Xarray is the DataArray that provides support for labelled multi-dimensional arrays. Project Pythia provides a broad introduction to this package. We’ll focus mainly on the specific parts of the Xarray API that we’ll use for our particular geospatial analyses.

Let’s load an example xarray.DataArray data structure from a file whose location is given by LOCAL_PATH.

LOCAL_PATH = Path.cwd().parent / 'assets' / 'OPERA_L3_DIST-ALERT-HLS_T10TEM_20220815T185931Z_20220817T153514Z_S2A_30_v0.1_VEG-ANOM-MAX.tif'
data = rio.open_rasterio(LOCAL_PATH)

Examining the rich DataArray repr

When using a Jupyter notebook, the Xarray DataArray data can be examined interactively.

  • The output cell contains a rich Jupyter notebook repr for the DataArray class.
  • The triangles next to the “Coordinates”, “Indexes”, and “Attributes” headings can be clicked with a mouse to reveal an expanded view.
print(f'{type(data)=}\n')
data

Examining DataArray attributes programmatically

Of course, while this graphical view is handy, it is also possible to access various DataArray attributes programmatically. This permits us to write progam logic to manipulate DataArrays conditionally as needed. For instance:

print(data.coords)

The dimensions data.dims are the strings/labels associated with the DataArray axes.

data.dims

We can extract the coordinates as one-dimensional (homogeneous) NumPy arrays using the coords and the .values attributes.

print(data.coords['x'].values)

data.attrs is a dictionary containing other meta-data parsed from the GeoTIFF tags (the “Attributes” in the graphical view). Again, this is why rioxarray is useful; it is possible to write code that loads data from various fileformats into Xarray DataArrays, but this package wraps a lot of the messy code that would, e.g., populate data.attrs.

data.attrs

Using the DataArray rio accessor

As mentioned, rioxarray extends the class xarray.DataArray with an accessor called rio. The rio accessor effectively adds a namespace with a variety of attributes. We can use a Python list comprehension to display those that do not start with an underscore (the so-called “private” and “dunder” methods/attributes).

[name for name in dir(data.rio) if not name.startswith('_')]

The attribute data.rio.crs is important for our purposes; it provides access to the coordinate reference system associated with this raster dataset.

print(type(data.rio.crs))
print(data.rio.crs)

The .rio.crs attribute itself is a data structure of class CRS from the pyproj project. The Python repr for this class returns a string like EPSG:32610; this number refers to the EPSG Geodetic Parameter Dataset.

From Wikipedia:

EPSG Geodetic Parameter Dataset (also EPSG registry) is a public registry of geodetic datums, spatial reference systems, Earth ellipsoids, coordinate transformations and related units of measurement, originated by a member of the European Petroleum Survey Group (EPSG) in 1985. Each entity is assigned an EPSG code between 1024 and 32767, along with a standard machine-readable well-known text (WKT) representation. The dataset is maintained by the IOGP Geomatics Committee.


Manipulating data in a DataArray

This data is stored using a particular Universal Transverse Mercator (UTM) CRS; the coordinate labels would conventionally be easting and northing. However, when plotting, it will convenient to use longitude and latitude instead. We’ll relabel the coordinates to reflect this; that is, the coordinate labelled x will be relabelled as longitude and the coordinate called y will be relabelled latitude.

data = data.rename({'x':'longitude', 'y':'latitude'})
print(data.coords)

Again, even though the numerical values stored in the coordinate arrays don’t strictly make sense as (longitude, latitude) values, we’ll apply these labels now to simplify plotting later.

Xarray DataArray objects enable slicing much like Python lists do. The following two cells both extract the same subarray by two different method calls.

data.isel(longitude=slice(0,2))
data.sel(longitude=[499_995, 500_025])

Rather than using brackets to slice sections of arrays (as in NumPy), for DataArrays, we can use the sel or isel methods to select slices by continuous coordinate values or by integer positions (i.e., “pixel” coordinates) respectively. This is similar to using .loc and .iloc in Pandas to extract entries from a Pandas Series or DataFrame.

If we take a 2D slice from the 3D DataArray data, we can plot it using the .plot accessor (more on this later).

data.isel(band=0).plot();

That plot took some time to render because the array plotted had 3,600×3,6003,600\times3,600 pixels. We can use the Python builtin slice function to extract, say, every 100th pixel in either direction to plot a lower resolution image much faster.

steps = 100
subset = slice(0,None,steps)
view = data.isel(longitude=subset, latitude=subset, band=0)
view.plot();

The plot produced is rather dark (reflecting that most of the entries are zero according to the legend). Notice that the axes are labelled automatically using the coords we renamed before.


Extracting DataArray data to NumPy, Pandas

Notice that a DataArray is a wrapper around a NumPy array. That NumPy array can be retrieved using the .values attribute.

array = data.values
print(f'{type(array)=}')
print(f'{array.shape=}')
print(f'{array.dtype=}')
print(f'{array.nbytes=}')

This raster data is stored as 8-bit unsigned integer data, so one byte for each pixel. A single unsigned 8-bit integer can represent integer values between 0 and 255. In an array with a bit more than thirteen million elements, that means there are many repeated values. We can see by putting the pixel values into a Pandas Series and using the .value_counts method.

s_flat = pd.Series(array.flatten()).value_counts()
s_flat.sort_index()

Most of the entries in this raster array are zero. The numerical values vary between 0 and 100 with the exception of some 1,700 pixels with the value 255. This will make more sense when we discuss the DIST data product specification.


Accumulating & concatenating a sequence of DataArrays

It is often convenient to stack multiple two-dimensional arrays of raster data into a single three-dimensional array. In NumPy, this is typically done with the numpy.concatenate function. There is a similar utility in Xarray—xarray.concat (that is similar in design to the Pandas function pandas.concat). The principal distinction between numpy.concatenate and xarray.concat is that the latter function has to account for labelled coordinates while the former does not; this is important when, e.g., the coordinate axes of two rasters overlap but are not perfectly aligned.

To see how stacking rasters works, we’ll start by making a list of three GeoTIFF files (stored locally), initializing an empty list stack, and then building a list of DataArrays in a loop.

RASTER_FILES = list((Path(FILE_STEM, 'assets').glob('OPERA*VEG*.tif')))

stack = []
for path in RASTER_FILES:
    print(f"Stacking {path.name}..")
    data = rio.open_rasterio(path).rename(dict(x='longitude', y='latitude'))
    band_name = path.stem.split('_')[-1]
    data.coords.update({'band': [band_name]})
    data.attrs = dict(description=f"OPERA DIST product", units=None)
    stack.append(data)

Here are some important observations about the preceding code loop:

  • Using rioxarray.open_rasterio to load an Xarray DataArray into memory does a lot of work for us. In particular, it makes sure the continuous coordinates are sensibly aligned with the underlying pixel coordinates.
  • By default, data.coords has keys x & y that we choose to relabel as longitude & latitude respectively. Technically, the continuous coordinate values loaded from this particular GeoTIFF file are expressed in UTM coordinates (i.e., easting & northing), but, later, when plotting, the labels longitude & latitude will be more convenient.
  • data.coords['band'] as loaded from the file has value 1. We choose to overwrite that value with the name of the band (which we extract from the filename as band_name).
  • By default, rioxarray.open_rasterio populates data.attrs with key-value pairs extracted from the TIFF tags. For different bands/layers, these attribute dictionaries could have conflicting keys or values. It may be advisable to preserve this metadata in some circumstances; we simply choose to discard it in this context to avoid potential conflicts. The minimal dictionary of attributes in the final data structure will have description and units as its only keys.

Having built up a list of DataArrays in the list stack, we can assemble a three-dimensional DataArray using xarray.concat.

stack = xr.concat(stack, dim='band')

The xarray.concat function accepts a sequence of xarray.DataArray objects with conforming dimensions and concatenates them along a specified dimension. For this example, we stack two-dimensional rasters that correspond to different bands or layers; that’s why we use the option dim='band' in the call to xarray.concat. Later, we’ll stack two-dimensional rasters along a time axis instead (this involves slightly different code to ensure correct labelling & alignment).

Let’s examine stack through its repr in this Jupyter notebook.

stack

Notice that stack has a CRS associated with it that was parsed by rioxarray.open_rasterio.

stack.rio.crs

This process is very useful for analysis (assuming that there is enough memory available to store the entire collection of rasters). Later, we’ll use this approach numerous times to manage collections of rasters of conforming dimensions. The stack can then be used for producing a dynamic visualization with a slider or, alternatively, for producing a static plot.