Robin Wilson
Geography & Environment, University of Southampton
@sciremotesense robin@rtwilson.com
Processing large time series of raster data is difficult
Example:
We have many decades of daily raster data, and want to get:
XArray
¶The power of pandas
for multidimensional arrays
python
(obviously!)numpy
pandas
matplotlib
GDAL
rasterio
Q: How many people are experienced with numpy
?
Q: ...with pandas
?
Q: ...with GDAL
?
import xarray as xr
PM25 = xr.open_dataarray('/Users/robin/code/MAIACProcessing/All2014.nc')
PM25.shape
(181, 1162, 1240)
PM25.dims
(u'time', u'y', u'x')
seasonal = PM25.groupby('time.season').mean(dim='time')
seasonal.plot.imshow(col='season', robust=True)
<xarray.plot.facetgrid.FacetGrid at 0x10cddea10>
time_series = PM25.isel(x=1000, y=1100).to_pandas().dropna()
time_series
time 2014-01-07 62.0 2014-01-17 121.0 2014-01-25 127.0 2014-02-10 135.0 2014-02-11 186.0 2014-02-18 237.0 2014-02-25 352.0 2014-02-26 293.0 2014-02-27 260.0 2014-03-01 165.0 2014-03-08 221.0 2014-03-09 221.0 2014-03-10 291.0 2014-03-13 552.0 2014-03-16 156.0 2014-03-21 159.0 2014-03-22 160.0 2014-03-23 163.0 2014-03-25 261.0 2014-03-30 362.0 2014-04-04 19.0 2014-04-09 19.0 2014-04-10 19.0 2014-04-11 19.0 2014-04-13 33.0 2014-04-14 19.0 2014-04-16 159.0 2014-04-18 53.0 2014-04-19 118.0 2014-04-24 194.0 2014-04-25 250.0 2014-05-10 154.0 2014-05-13 181.0 2014-05-14 19.0 2014-05-15 117.0 2014-05-17 239.0 2014-05-22 188.0 2014-06-02 189.0 2014-06-05 71.0 2014-06-07 107.0 2014-06-13 115.0 2014-06-14 324.0 2014-06-15 64.0 2014-06-18 132.0 2014-06-20 101.0 2014-06-21 115.0 2014-06-23 200.0 2014-06-27 162.0 dtype: float32
one_day = PM25.sel(time='2014-02-20')
one_day.plot(robust=True)
<matplotlib.collections.QuadMesh at 0x119e41c90>
seasonal = PM25.groupby('time.season').mean(dim='time')
time_series = PM25.isel(x=1000, y=1100).to_pandas().dropna()
one_day = PM25.sel(time='2014-02-20')
dask
and dask.distributed
xarray.DataArray
is a fancy, labelled version of a numpy.ndarray
xarray.Dataset
is a collection of multiple DataArray
s which share dimensions
arr = np.random.rand(3, 4, 2)
xr.DataArray(arr)
<xarray.DataArray (dim_0: 3, dim_1: 4, dim_2: 2)> array([[[ 0.530635, 0.049118], [ 0.709483, 0.162677], [ 0.838245, 0.956551], [ 0.724748, 0.834736]], [[ 0.743153, 0.307386], [ 0.823829, 0.570499], [ 0.629053, 0.018669], [ 0.896682, 0.702414]], [[ 0.167959, 0.295538], [ 0.394742, 0.947428], [ 0.700852, 0.312621], [ 0.276099, 0.253416]]]) Dimensions without coordinates: dim_0, dim_1, dim_2
xr.DataArray(arr, dims=('x', 'y', 'time'))
<xarray.DataArray (x: 3, y: 4, time: 2)> array([[[ 0.530635, 0.049118], [ 0.709483, 0.162677], [ 0.838245, 0.956551], [ 0.724748, 0.834736]], [[ 0.743153, 0.307386], [ 0.823829, 0.570499], [ 0.629053, 0.018669], [ 0.896682, 0.702414]], [[ 0.167959, 0.295538], [ 0.394742, 0.947428], [ 0.700852, 0.312621], [ 0.276099, 0.253416]]]) Dimensions without coordinates: x, y, time
da = xr.DataArray(arr,
dims=('x', 'y', 'time'),
coords={'x': [10, 20, 30],
'y': [0.3, 0.7, 1.3, 1.5],
'time': [datetime.datetime(2016, 3, 5),
datetime.datetime(2016, 4, 7)]})
da
<xarray.DataArray (x: 3, y: 4, time: 2)> array([[[ 0.530635, 0.049118], [ 0.709483, 0.162677], [ 0.838245, 0.956551], [ 0.724748, 0.834736]], [[ 0.743153, 0.307386], [ 0.823829, 0.570499], [ 0.629053, 0.018669], [ 0.896682, 0.702414]], [[ 0.167959, 0.295538], [ 0.394742, 0.947428], [ 0.700852, 0.312621], [ 0.276099, 0.253416]]]) Coordinates: * y (y) float64 0.3 0.7 1.3 1.5 * x (x) int64 10 20 30 * time (time) datetime64[ns] 2016-03-05 2016-04-07
da.sel(time='2016-03-05')
<xarray.DataArray (x: 3, y: 4)> array([[ 0.530635, 0.709483, 0.838245, 0.724748], [ 0.743153, 0.823829, 0.629053, 0.896682], [ 0.167959, 0.394742, 0.700852, 0.276099]]) Coordinates: * y (y) float64 0.3 0.7 1.3 1.5 * x (x) int64 10 20 30 time datetime64[ns] 2016-03-05
da.isel(time=1)
<xarray.DataArray (x: 3, y: 4)> array([[ 0.049118, 0.162677, 0.956551, 0.834736], [ 0.307386, 0.570499, 0.018669, 0.702414], [ 0.295538, 0.947428, 0.312621, 0.253416]]) Coordinates: * y (y) float64 0.3 0.7 1.3 1.5 * x (x) int64 10 20 30 time datetime64[ns] 2016-04-07
da.sel(x=slice(0, 20))
<xarray.DataArray (x: 2, y: 4, time: 2)> array([[[ 0.530635, 0.049118], [ 0.709483, 0.162677], [ 0.838245, 0.956551], [ 0.724748, 0.834736]], [[ 0.743153, 0.307386], [ 0.823829, 0.570499], [ 0.629053, 0.018669], [ 0.896682, 0.702414]]]) Coordinates: * y (y) float64 0.3 0.7 1.3 1.5 * x (x) int64 10 20 * time (time) datetime64[ns] 2016-03-05 2016-04-07
da.mean(dim='time')
<xarray.DataArray (x: 3, y: 4)> array([[ 0.289876, 0.43608 , 0.897398, 0.779742], [ 0.525269, 0.697164, 0.323861, 0.799548], [ 0.231749, 0.671085, 0.506736, 0.264758]]) Coordinates: * y (y) float64 0.3 0.7 1.3 1.5 * x (x) int64 10 20 30
da.mean(dim=['x', 'y'])
<xarray.DataArray (time: 2)> array([ 0.619623, 0.450921]) Coordinates: * time (time) datetime64[ns] 2016-03-05 2016-04-07
PM25.sel(time='2014').groupby('time.month').std(dim='time')
<xarray.DataArray u'data' (month: 6, y: 1162, x: 1240)> array([[[ nan, nan, ..., nan, nan], [ nan, nan, ..., nan, nan], ..., [ nan, nan, ..., nan, nan], [ nan, nan, ..., nan, nan]], [[ nan, nan, ..., nan, nan], [ 0. , nan, ..., nan, nan], ..., [ nan, nan, ..., nan, nan], [ nan, nan, ..., nan, nan]], ..., [[ 23.976839, 0. , ..., nan, nan], [ 1. , 2.5 , ..., nan, nan], ..., [ nan, nan, ..., nan, nan], [ nan, nan, ..., nan, nan]], [[ 146.487762, 125.048988, ..., nan, nan], [ 146.279037, 148.149338, ..., nan, nan], ..., [ nan, nan, ..., nan, nan], [ nan, nan, ..., nan, nan]]], dtype=float32) Coordinates: * y (y) float64 1.429e+06 1.428e+06 1.427e+06 1.426e+06 1.424e+06 ... * x (x) float64 -9.476e+05 -9.464e+05 -9.451e+05 -9.439e+05 ... * month (month) int64 1 2 3 4 5 6
dask
and dask.distributed
¶dask
creates a computational graph of your processing steps, and then executes it as efficiently as possible.
This includes only loading data that is actually needed and only processing things once.
data = xr.open_mfdataset(['DaskTest1.nc', 'DaskTest2.nc'], chunks={'time':10})['data']
avg = data.mean(dim='time')
seasonal = data.groupby('time.season').mean(dim='time')
dask.distributed
¶All of these different chunks, and separate processing chains can be run on separate computers.
xarray
¶xarray
can read various raster formats directly:
However, xarray
can't currently read other standard raster formats like:
rasterio
¶A nice, Pythonic interface to GDAL
- making it really easy to read almost any raster file into Python
import rasterio
with rasterio.open('SPOT_ROI.tif') as src:
data = src.read(1)
print(data)
[[51 51 51 ..., 32 32 29] [51 51 51 ..., 29 30 27] [54 54 52 ..., 30 29 28] ..., [33 34 34 ..., 31 35 35] [34 34 34 ..., 32 34 34] [34 34 34 ..., 31 33 38]]
All we need to do is write some functions to read from rasterio
and create a DataArray
But it's a bit more difficult than that...
data = rasterio_to_xarray('SPOT_ROI.tif')
data
rasterio_to_xarray.py:29: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future. data = np.where(data == src.nodata, np.nan, data)
<xarray.DataArray (y: 1644, x: 1435)> array([[ 51., 51., 51., ..., 32., 32., 29.], [ 51., 51., 51., ..., 29., 30., 27.], [ 54., 54., 52., ..., 30., 29., 28.], ..., [ 33., 34., 34., ..., 31., 35., 35.], [ 34., 34., 34., ..., 32., 34., 34.], [ 34., 34., 34., ..., 31., 33., 38.]]) Coordinates: * y (y) float64 1.484e+05 1.484e+05 1.484e+05 1.484e+05 1.484e+05 ... * x (x) float64 4.301e+05 4.301e+05 4.302e+05 4.302e+05 4.302e+05 ... Attributes: crs: +ellps=airy +k=0.999601 +lat_0=49 +lon_0=-2 +no_defs +proj=tmerc +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +x_0=400000 +y_0=-100000 affine: (430130.2396, 10.0, 0.0, 148415.8755, 0.0, -10.0)
Take a large number of files, one from each orbit of a satellite, and put them into one large array with a time dimension.
list_of_data_arrays = [file_to_da(filename) for filename in files]
combined = xr.concat(list_of_data_arrays, dim='time')
combined.shape
(10, 1162, 1240)
combined.coords
Coordinates: * y (y) float64 1.429e+06 1.428e+06 1.427e+06 1.426e+06 1.424e+06 ... * x (x) float64 -9.476e+05 -9.464e+05 -9.451e+05 -9.439e+05 ... * time (time) datetime64[ns] 2003-04-16T14:40:00 2002-03-19T11:20:00 ...
mean = combined.mean(dim='time', keep_attrs=True)
xarray_to_rasterio(mean, 'Mean.tif')
Can xarray
do standard geographic processing methods?
DataArray
is just a fancy numpy
array.values
to get the numpy
arrayrasterstats
is a Python module for doing zonal statistics
We can use the same approach as the Python module, but modified to work with DataArray
objects
mean
, median
, std
etc)(There are more efficient ways to do this if no polygons overlap)
xarray
is under rapid developmentxarray
soon-ishdask
is rapidly improving too