Skip to article frontmatterSkip to article content

Array Manipulation with Xarray

There are numerous ways to work with geospatial data and so choosing tooling can be difficult. The principal library we’ll be using is Xarray for its DataArray and Dataset data structures and associated utilities as well as NumPy and Pandas for manipulating homogeneous numerical arrays and tabular data respectively.

from warnings import filterwarnings
filterwarnings('ignore')
from pathlib import Path
import numpy as np, pandas as pd, xarray as xr
import rioxarray as rio

The principal data structure in Xarray is the DataArray that provides support for labelled multi-dimensional arrays. Project Pythia provides a broad introduction to this package. We’ll focus mainly on the specific parts of the Xarray API that we’ll use for our particular geospatial analyses.

Let’s load an example xarray.DataArray data structure from a file whose location is given by LOCAL_PATH.

LOCAL_PATH = Path.cwd().parent / 'assets' / 'OPERA_L3_DIST-ALERT-HLS_T10TEM_20220815T185931Z_20220817T153514Z_S2A_30_v0.1_VEG-ANOM-MAX.tif'
data = rio.open_rasterio(LOCAL_PATH)

Examining the rich DataArray repr

When using a Jupyter notebook, the Xarray DataArray data can be examined interactively.

print(f'{type(data)=}\n')
data

Examining DataArray attributes programmatically

Of course, while this graphical view is handy, it is also possible to access various DataArray attributes programmatically. This permits us to write progam logic to manipulate DataArrays conditionally as needed. For instance:

print(data.coords)

The dimensions data.dims are the strings/labels associated with the DataArray axes.

data.dims

We can extract the coordinates as one-dimensional (homogeneous) NumPy arrays using the coords and the .values attributes.

print(data.coords['x'].values)

data.attrs is a dictionary containing other meta-data parsed from the GeoTIFF tags (the “Attributes” in the graphical view). Again, this is why rioxarray is useful; it is possible to write code that loads data from various fileformats into Xarray DataArrays, but this package wraps a lot of the messy code that would, e.g., populate data.attrs.

data.attrs

Using the DataArray rio accessor

As mentioned, rioxarray extends the class xarray.DataArray with an accessor called rio. The rio accessor effectively adds a namespace with a variety of attributes. We can use a Python list comprehension to display those that do not start with an underscore (the so-called “private” and “dunder” methods/attributes).

[name for name in dir(data.rio) if not name.startswith('_')]

The attribute data.rio.crs is important for our purposes; it provides access to the coordinate reference system associated with this raster dataset.

print(type(data.rio.crs))
print(data.rio.crs)

The .rio.crs attribute itself is a data structure of class CRS from the pyproj project. The Python repr for this class returns a string like EPSG:32610; this number refers to the EPSG Geodetic Parameter Dataset.

From Wikipedia:

EPSG Geodetic Parameter Dataset (also EPSG registry) is a public registry of geodetic datums, spatial reference systems, Earth ellipsoids, coordinate transformations and related units of measurement, originated by a member of the European Petroleum Survey Group (EPSG) in 1985. Each entity is assigned an EPSG code between 1024 and 32767, along with a standard machine-readable well-known text (WKT) representation. The dataset is maintained by the IOGP Geomatics Committee.


Manipulating data in a DataArray

This data is stored using a particular Universal Transverse Mercator (UTM) CRS; the coordinate labels would conventionally be easting and northing. However, when plotting, it will convenient to use longitude and latitude instead. We’ll relabel the coordinates to reflect this; that is, the coordinate labelled x will be relabelled as longitude and the coordinate called y will be relabelled latitude.

data = data.rename({'x':'longitude', 'y':'latitude'})
print(data.coords)

Again, even though the numerical values stored in the coordinate arrays don’t strictly make sense as (longitude, latitude) values, we’ll apply these labels now to simplify plotting later.

Xarray DataArray objects enable slicing much like Python lists do. The following two cells both extract the same subarray by two different method calls.

data.isel(longitude=slice(0,2))
data.sel(longitude=[499_995, 500_025])

Rather than using brackets to slice sections of arrays (as in NumPy), for DataArrays, we can use the sel or isel methods to select slices by continuous coordinate values or by integer positions (i.e., “pixel” coordinates) respectively. This is similar to using .loc and .iloc in Pandas to extract entries from a Pandas Series or DataFrame.

If we take a 2D slice from the 3D DataArray data, we can plot it using the .plot accessor (more on this later).

data.isel(band=0).plot();

That plot took some time to render because the array plotted had 3,600×3,6003,600\times3,600 pixels. We can use the Python builtin slice function to extract, say, every 100th pixel in either direction to plot a lower resolution image much faster.

steps = 100
subset = slice(0,None,steps)
view = data.isel(longitude=subset, latitude=subset, band=0)
view.plot();

The plot produced is rather dark (reflecting that most of the entries are zero according to the legend). Notice that the axes are labelled automatically using the coords we renamed before.


Extracting DataArray data to NumPy, Pandas

Notice that a DataArray is a wrapper around a NumPy array. That NumPy array can be retrieved using the .values attribute.

array = data.values
print(f'{type(array)=}')
print(f'{array.shape=}')
print(f'{array.dtype=}')
print(f'{array.nbytes=}')

This raster data is stored as 8-bit unsigned integer data, so one byte for each pixel. A single unsigned 8-bit integer can represent integer values between 0 and 255. In an array with a bit more than thirteen million elements, that means there are many repeated values. We can see by putting the pixel values into a Pandas Series and using the .value_counts method.

s_flat = pd.Series(array.flatten()).value_counts()
s_flat.sort_index()

Most of the entries in this raster array are zero. The numerical values vary between 0 and 100 with the exception of some 1,700 pixels with the value 255. This will make more sense when we discuss the DIST data product specification.


Accumulating & concatenating a sequence of DataArrays

It is often convenient to stack multiple two-dimensional arrays of raster data into a single three-dimensional array. In NumPy, this is typically done with the numpy.concatenate function. There is a similar utility in Xarray—xarray.concat (that is similar in design to the Pandas function pandas.concat). The principal distinction between numpy.concatenate and xarray.concat is that the latter function has to account for labelled coordinates while the former does not; this is important when, e.g., the coordinate axes of two rasters overlap but are not perfectly aligned.

To see how stacking rasters works, we’ll start by making a list of three GeoTIFF files (stored locally), initializing an empty list stack, and then building a list of DataArrays in a loop.

RASTER_FILES = list((Path(FILE_STEM, 'assets').glob('OPERA*VEG*.tif')))

stack = []
for path in RASTER_FILES:
    print(f"Stacking {path.name}..")
    data = rio.open_rasterio(path).rename(dict(x='longitude', y='latitude'))
    band_name = path.stem.split('_')[-1]
    data.coords.update({'band': [band_name]})
    data.attrs = dict(description=f"OPERA DIST product", units=None)
    stack.append(data)

Here are some important observations about the preceding code loop:

Having built up a list of DataArrays in the list stack, we can assemble a three-dimensional DataArray using xarray.concat.

stack = xr.concat(stack, dim='band')

The xarray.concat function accepts a sequence of xarray.DataArray objects with conforming dimensions and concatenates them along a specified dimension. For this example, we stack two-dimensional rasters that correspond to different bands or layers; that’s why we use the option dim='band' in the call to xarray.concat. Later, we’ll stack two-dimensional rasters along a time axis instead (this involves slightly different code to ensure correct labelling & alignment).

Let’s examine stack through its repr in this Jupyter notebook.

stack

Notice that stack has a CRS associated with it that was parsed by rioxarray.open_rasterio.

stack.rio.crs

This process is very useful for analysis (assuming that there is enough memory available to store the entire collection of rasters). Later, we’ll use this approach numerous times to manage collections of rasters of conforming dimensions. The stack can then be used for producing a dynamic visualization with a slider or, alternatively, for producing a static plot.