Robin Wilson
Geography & Environment, University of Southampton
@sciremotesense robin@rtwilson.com
Processing large time series of raster data is difficult
Example:
We have many decades of daily raster data, and want to get:
XArray¶The power of pandas for multidimensional arrays
python (obviously!)numpypandasmatplotlibGDALrasterioQ: How many people are experienced with numpy?
Q: ...with pandas?
Q: ...with GDAL?
import xarray as xr
PM25 = xr.open_dataarray('/Users/robin/code/MAIACProcessing/All2014.nc')
PM25.shape
(181, 1162, 1240)
PM25.dims
(u'time', u'y', u'x')
seasonal = PM25.groupby('time.season').mean(dim='time')
seasonal.plot.imshow(col='season', robust=True)
<xarray.plot.facetgrid.FacetGrid at 0x10cddea10>
time_series = PM25.isel(x=1000, y=1100).to_pandas().dropna()
time_series
time 2014-01-07 62.0 2014-01-17 121.0 2014-01-25 127.0 2014-02-10 135.0 2014-02-11 186.0 2014-02-18 237.0 2014-02-25 352.0 2014-02-26 293.0 2014-02-27 260.0 2014-03-01 165.0 2014-03-08 221.0 2014-03-09 221.0 2014-03-10 291.0 2014-03-13 552.0 2014-03-16 156.0 2014-03-21 159.0 2014-03-22 160.0 2014-03-23 163.0 2014-03-25 261.0 2014-03-30 362.0 2014-04-04 19.0 2014-04-09 19.0 2014-04-10 19.0 2014-04-11 19.0 2014-04-13 33.0 2014-04-14 19.0 2014-04-16 159.0 2014-04-18 53.0 2014-04-19 118.0 2014-04-24 194.0 2014-04-25 250.0 2014-05-10 154.0 2014-05-13 181.0 2014-05-14 19.0 2014-05-15 117.0 2014-05-17 239.0 2014-05-22 188.0 2014-06-02 189.0 2014-06-05 71.0 2014-06-07 107.0 2014-06-13 115.0 2014-06-14 324.0 2014-06-15 64.0 2014-06-18 132.0 2014-06-20 101.0 2014-06-21 115.0 2014-06-23 200.0 2014-06-27 162.0 dtype: float32
one_day = PM25.sel(time='2014-02-20')
one_day.plot(robust=True)
<matplotlib.collections.QuadMesh at 0x119e41c90>
seasonal = PM25.groupby('time.season').mean(dim='time')
time_series = PM25.isel(x=1000, y=1100).to_pandas().dropna()
one_day = PM25.sel(time='2014-02-20')
dask and dask.distributedxarray.DataArray is a fancy, labelled version of a numpy.ndarray
xarray.Dataset is a collection of multiple DataArrays which share dimensions
arr = np.random.rand(3, 4, 2)
xr.DataArray(arr)
<xarray.DataArray (dim_0: 3, dim_1: 4, dim_2: 2)>
array([[[ 0.530635, 0.049118],
[ 0.709483, 0.162677],
[ 0.838245, 0.956551],
[ 0.724748, 0.834736]],
[[ 0.743153, 0.307386],
[ 0.823829, 0.570499],
[ 0.629053, 0.018669],
[ 0.896682, 0.702414]],
[[ 0.167959, 0.295538],
[ 0.394742, 0.947428],
[ 0.700852, 0.312621],
[ 0.276099, 0.253416]]])
Dimensions without coordinates: dim_0, dim_1, dim_2
xr.DataArray(arr, dims=('x', 'y', 'time'))
<xarray.DataArray (x: 3, y: 4, time: 2)>
array([[[ 0.530635, 0.049118],
[ 0.709483, 0.162677],
[ 0.838245, 0.956551],
[ 0.724748, 0.834736]],
[[ 0.743153, 0.307386],
[ 0.823829, 0.570499],
[ 0.629053, 0.018669],
[ 0.896682, 0.702414]],
[[ 0.167959, 0.295538],
[ 0.394742, 0.947428],
[ 0.700852, 0.312621],
[ 0.276099, 0.253416]]])
Dimensions without coordinates: x, y, time
da = xr.DataArray(arr,
dims=('x', 'y', 'time'),
coords={'x': [10, 20, 30],
'y': [0.3, 0.7, 1.3, 1.5],
'time': [datetime.datetime(2016, 3, 5),
datetime.datetime(2016, 4, 7)]})
da
<xarray.DataArray (x: 3, y: 4, time: 2)>
array([[[ 0.530635, 0.049118],
[ 0.709483, 0.162677],
[ 0.838245, 0.956551],
[ 0.724748, 0.834736]],
[[ 0.743153, 0.307386],
[ 0.823829, 0.570499],
[ 0.629053, 0.018669],
[ 0.896682, 0.702414]],
[[ 0.167959, 0.295538],
[ 0.394742, 0.947428],
[ 0.700852, 0.312621],
[ 0.276099, 0.253416]]])
Coordinates:
* y (y) float64 0.3 0.7 1.3 1.5
* x (x) int64 10 20 30
* time (time) datetime64[ns] 2016-03-05 2016-04-07
da.sel(time='2016-03-05')
<xarray.DataArray (x: 3, y: 4)>
array([[ 0.530635, 0.709483, 0.838245, 0.724748],
[ 0.743153, 0.823829, 0.629053, 0.896682],
[ 0.167959, 0.394742, 0.700852, 0.276099]])
Coordinates:
* y (y) float64 0.3 0.7 1.3 1.5
* x (x) int64 10 20 30
time datetime64[ns] 2016-03-05
da.isel(time=1)
<xarray.DataArray (x: 3, y: 4)>
array([[ 0.049118, 0.162677, 0.956551, 0.834736],
[ 0.307386, 0.570499, 0.018669, 0.702414],
[ 0.295538, 0.947428, 0.312621, 0.253416]])
Coordinates:
* y (y) float64 0.3 0.7 1.3 1.5
* x (x) int64 10 20 30
time datetime64[ns] 2016-04-07
da.sel(x=slice(0, 20))
<xarray.DataArray (x: 2, y: 4, time: 2)>
array([[[ 0.530635, 0.049118],
[ 0.709483, 0.162677],
[ 0.838245, 0.956551],
[ 0.724748, 0.834736]],
[[ 0.743153, 0.307386],
[ 0.823829, 0.570499],
[ 0.629053, 0.018669],
[ 0.896682, 0.702414]]])
Coordinates:
* y (y) float64 0.3 0.7 1.3 1.5
* x (x) int64 10 20
* time (time) datetime64[ns] 2016-03-05 2016-04-07
da.mean(dim='time')
<xarray.DataArray (x: 3, y: 4)>
array([[ 0.289876, 0.43608 , 0.897398, 0.779742],
[ 0.525269, 0.697164, 0.323861, 0.799548],
[ 0.231749, 0.671085, 0.506736, 0.264758]])
Coordinates:
* y (y) float64 0.3 0.7 1.3 1.5
* x (x) int64 10 20 30
da.mean(dim=['x', 'y'])
<xarray.DataArray (time: 2)> array([ 0.619623, 0.450921]) Coordinates: * time (time) datetime64[ns] 2016-03-05 2016-04-07
PM25.sel(time='2014').groupby('time.month').std(dim='time')
<xarray.DataArray u'data' (month: 6, y: 1162, x: 1240)>
array([[[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan],
...,
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan]],
[[ nan, nan, ..., nan, nan],
[ 0. , nan, ..., nan, nan],
...,
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan]],
...,
[[ 23.976839, 0. , ..., nan, nan],
[ 1. , 2.5 , ..., nan, nan],
...,
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan]],
[[ 146.487762, 125.048988, ..., nan, nan],
[ 146.279037, 148.149338, ..., nan, nan],
...,
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan]]], dtype=float32)
Coordinates:
* y (y) float64 1.429e+06 1.428e+06 1.427e+06 1.426e+06 1.424e+06 ...
* x (x) float64 -9.476e+05 -9.464e+05 -9.451e+05 -9.439e+05 ...
* month (month) int64 1 2 3 4 5 6
dask and dask.distributed¶dask creates a computational graph of your processing steps, and then executes it as efficiently as possible.
This includes only loading data that is actually needed and only processing things once.
data = xr.open_mfdataset(['DaskTest1.nc', 'DaskTest2.nc'], chunks={'time':10})['data']
avg = data.mean(dim='time')
seasonal = data.groupby('time.season').mean(dim='time')
dask.distributed¶All of these different chunks, and separate processing chains can be run on separate computers.
xarray¶xarray can read various raster formats directly:
However, xarray can't currently read other standard raster formats like:
rasterio¶A nice, Pythonic interface to GDAL - making it really easy to read almost any raster file into Python
import rasterio
with rasterio.open('SPOT_ROI.tif') as src:
data = src.read(1)
print(data)
[[51 51 51 ..., 32 32 29] [51 51 51 ..., 29 30 27] [54 54 52 ..., 30 29 28] ..., [33 34 34 ..., 31 35 35] [34 34 34 ..., 32 34 34] [34 34 34 ..., 31 33 38]]
All we need to do is write some functions to read from rasterio and create a DataArray
But it's a bit more difficult than that...
data = rasterio_to_xarray('SPOT_ROI.tif')
data
rasterio_to_xarray.py:29: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future. data = np.where(data == src.nodata, np.nan, data)
<xarray.DataArray (y: 1644, x: 1435)>
array([[ 51., 51., 51., ..., 32., 32., 29.],
[ 51., 51., 51., ..., 29., 30., 27.],
[ 54., 54., 52., ..., 30., 29., 28.],
...,
[ 33., 34., 34., ..., 31., 35., 35.],
[ 34., 34., 34., ..., 32., 34., 34.],
[ 34., 34., 34., ..., 31., 33., 38.]])
Coordinates:
* y (y) float64 1.484e+05 1.484e+05 1.484e+05 1.484e+05 1.484e+05 ...
* x (x) float64 4.301e+05 4.301e+05 4.302e+05 4.302e+05 4.302e+05 ...
Attributes:
crs: +ellps=airy +k=0.999601 +lat_0=49 +lon_0=-2 +no_defs +proj=tmerc +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +x_0=400000 +y_0=-100000
affine: (430130.2396, 10.0, 0.0, 148415.8755, 0.0, -10.0)
Take a large number of files, one from each orbit of a satellite, and put them into one large array with a time dimension.
list_of_data_arrays = [file_to_da(filename) for filename in files]
combined = xr.concat(list_of_data_arrays, dim='time')
combined.shape
(10, 1162, 1240)
combined.coords
Coordinates: * y (y) float64 1.429e+06 1.428e+06 1.427e+06 1.426e+06 1.424e+06 ... * x (x) float64 -9.476e+05 -9.464e+05 -9.451e+05 -9.439e+05 ... * time (time) datetime64[ns] 2003-04-16T14:40:00 2002-03-19T11:20:00 ...
mean = combined.mean(dim='time', keep_attrs=True)
xarray_to_rasterio(mean, 'Mean.tif')
Can xarray do standard geographic processing methods?
DataArray is just a fancy numpy array.values to get the numpy arrayrasterstats is a Python module for doing zonal statistics
We can use the same approach as the Python module, but modified to work with DataArray objects
mean, median, std etc)(There are more efficient ways to do this if no polygons overlap)
xarray is under rapid developmentxarray soon-ishdask is rapidly improving too