Getting started

Introduction

{mlr3spatial} adds [mlr3::DataBackend]s for spatial classes ([terra::SpatRaster], [raster::brick], [stars::stars]). The package is capable of making predictions to objects of these classes via predict_raster(). The return is a spatial object in the class specified in argument format.

Essentially, {mlr3spatial} takes of the burden of converting spatial objects into a plain data.table and then coercing the predicted values back into the spatial object while making sure to not loose the spatial reference.

There are some more goodies in the bag though:

Thanks to mlr3’s ability to predict in parallel with any learner, {mlr3spatial} predictions can also make use of future-based parallelization and speed up the predictions of spatial objects. Often enough, spatial predictions are quite large (in the millions of values) and efficient parallelization can save some time here. See the vignette on “Benchmarking parallel predictions” for details.
predict_raster() can be executed in chunks (argument chunksize), making it possible to execute predictions on large raster files on consumer grade machines. This chunked execution comes with some overhead compared to execution in one block since the prediction heuristic is executed multiple times and the results need to be merged in the end.

In the following, we showcase a step-by-step example how to handle a multi-layer raster object from package {stars}.

Use Case - Landsat7 data as {stars} object

Data Preparation

library("mlr3")
library("mlr3spatial")

First, the TIFF files is read via stars::read_stars() and put into a DataBackendRaster. The DataBackend is then used to create a regression task with the response being band1.

tif = system.file("tif/L7_ETMs.tif", package = "stars")
stack = stars::read_stars(tif)

backend = as_data_backend(stack)
task = as_task_regr(backend, target = "band1")

print(task)
#> <TaskRegr:backend> (122848 x 6)
#> * Target: band1
#> * Properties: -
#> * Features (5):
#>   - dbl (5): band2, band3, band4, band5, band6

For large raster files with millions of values it helps to predict in parallel. To enable this, set learner$parallel_predict = TRUE and initiate a parallel plan via {future}. Since this is only an example, parallelization is not enabled here. Here we will use a simple regression tree as an example learner. In practice you might want to use a different learner - you can find an overview of available learners here.

learner = lrn("regr.rpart")
set.seed(42)
row_ids = sample(1:task$nrow, 500)
learner$train(task, row_ids = row_ids)
#> INFO  [19:35:32.382] 253 
#> INFO  [19:35:32.431] 253

print(learner)
#> <LearnerRegrRpart:regr.rpart>: Regression Tree
#> * Model: rpart
#> * Parameters: xval=0
#> * Packages: mlr3, rpart
#> * Predict Type: response
#> * Feature types: logical, integer, numeric, factor, ordered
#> * Properties: importance, missings, selected_features, weights

Prediction

For prediction predict_spatial() is used. It will return a raster file which contains the predictions. Users can select which R spatial format the returned raster should have.

In the following, we will compare the way to conduct the prediction using {mlr3spatial} with the “native” way of fitting an e1071::svm() model and predicting with terra::predict().

mlr3spatial

ras = predict_spatial(task, learner, format = "stars")
#> INFO  [19:35:32.729] Start raster prediction 
#> INFO  [19:35:32.734] Prediction is executed with a chunksize of 200, 1 chunk(s) in total, 122848 values per chunk 
#> INFO  [19:35:32.791] 220 
#> INFO  [19:35:32.989] 219 
#> INFO  [19:35:32.995] Chunk 1 of 1 finished 
#> INFO  [19:35:33.006] Finished raster prediction in 0 seconds
names(ras) = "cadmium"

print(ras)
#> stars object with 2 dimensions and 1 attribute
#> attribute(s):
#>             Min.  1st Qu.   Median     Mean 3rd Qu.     Max.
#> cadmium  62.3629 70.30233 77.01695 79.05135 89.2809 118.1429
#> dimension(s):
#>   from  to  offset delta                     refsys point values x/y
#> x    1 349  288776  28.5 SIRGAS 2000 / UTM zone 25S FALSE   NULL [x]
#> y    1 352 9120761 -28.5 SIRGAS 2000 / UTM zone 25S FALSE   NULL [y]

stars

Since the layers are merged in a {stars} object, one first need to split them up and convert them into a regular data.table. Next, the column names need to be adjusted to match the ones of the training data. Afterwards, the data.frame generic of predict() can be called. Finally, the predictions need to be injected into a stars object again.

(All of these steps are happening internally in {mlr3spatial}).

rpart_learner = rpart::rpart(band1 ~ ., data = task$data(rows = row_ids))
#> INFO  [19:35:33.171] 216
stars_stack = as.data.table(split(stack, "band"))
stars_stack[, c("x", "y", "X1")] = NULL
colnames(stars_stack) = task$feature_names

stars_pred = predict(rpart_learner, stars_stack)

# subset stars object to one band only
stars_pred_ras = stack[, , , 1]
# rename the layer name
names(stars_pred_ras) = "pred"
# assign predictions
stars_pred_ras$pred = stars_pred

print(stars_pred_ras)
#> stars object with 3 dimensions and 1 attribute
#> attribute(s):
#>          Min.  1st Qu.   Median     Mean 3rd Qu.     Max.
#> pred  62.3629 70.30233 77.01695 79.05135 89.2809 118.1429
#> dimension(s):
#>      from  to  offset delta                     refsys point values x/y
#> x       1 349  288776  28.5 SIRGAS 2000 / UTM zone 25S FALSE   NULL [x]
#> y       1 352 9120761 -28.5 SIRGAS 2000 / UTM zone 25S FALSE   NULL [y]
#> band    1   1      NA    NA                         NA    NA   NULL

Output consistency

Now that we have executed two predictions, we would like to verify that these are actually identical.

all.equal(as.numeric(stars_pred_ras$pred), as.numeric(ras$cadmium))
#> [1] TRUE

Visualization

Finally we can plot the predictions. The color vector is extract from the viridis color palette via dput(viridis::viridis_pal()(5)).

plot(ras, col = c("#440154FF", "#443A83FF", "#31688EFF", "#21908CFF", "#35B779FF", "#8FD744FF", "#FDE725FF"))