Imputing forest attributes with FOSTER

Martin Queinnec

2021-03-29

Introduction

The goal of this vignette is to illustrate how FOSTER can be used to impute ALS-derived forest variables (response variables Y) to a larger area covered by multispectral satellite imagery and topographic data (predictor variables X). We can usually describe an imputation problem by defining two sets of observations: the reference and the target observations. At reference observations, both Y and X variables are defined while only X variables are available at targets. Ultimately, targets are the area where we want to impute response variables.

FOSTER has been designed around the following workflow:

FOSTER workflow

FOSTER workflow

Using FOSTER

Load packages

In order to use the functions of FOSTER directly we need to attach the package using library(). Otherwise the functions need to be called explicitly from foster namespace using foster::function_name(). The raster package is regularly used throughout the workflow to read data using the functions raster, stack or brick (see below). It is therefore recommended attach the raster package as well.

library(foster)
library(raster)

Main functions implemented in FOSTER

Data preparation

  • matchExtent: match the extent of a raster from a reference. Cells of the reference having a specific value can be masked in the output raster object.
  • matchResolution: successively project and resample a raster coordinate system and spatial resolution to the ones of a reference raster. The input layer keeps its original extent instead of inheriting from the reference.
  • focalMultiband: apply a spatial filter (function) in the neighborhood of each cell.
  • edges: assign NA values to cell located in the neighborhood of other cells having NA values. This can be used for example to avoid sampling cells located close to borders.
  • tile: split a raster into smaller tiles. Can be used to reduce the size of data to be processed at a time and avoid memory issues.

Stratified random sampling:

  • getSample: perform a k-means clustering of a raster and randomly sample cells within each cluster proportionally to the presence of those clusters across the entire raster
  • getSampleValues: extract the values of a raster at sample points

Calculate spectral indices and time series-based metrics

These functions supports both raster and point features (ESRI Shapefiles) to calculate wall-to-wall predictor variables or only at sampled observations

  • calcIndices: calculate a set of spectral indices from multispectral data
  • temporalMetrics: summarize variables time series in a few metrics (e.g. mean, median, slope, IQR)

Train a k-NN model and assess its accuracy

  • partition: split samples into training and testing sets
  • trainNN: train a k-NN model from the training sets and use the trained model to impute the Y variables on the testing set. Also returns the model accuracy by comparing observed and imputed response variables from the testing set
  • accuracy: compute accuracy metrics from observed and predicted variables
  • scatter : create a scatterplot between observed and predicted variables
  • varImp: return the importance of each predictor variable if random forest is used to calculate the nearest neighbors

Impute response variables at targets

  • predictTrgs: impute response variables from a trained k-NN model and predictor variables at targets.

Data types

Two types of data are encountered when using FOSTER: raster data for wall-to-wall variables and vectors for variables extracted at sample points (point features).

Reading data from disk

Raster data

If the raster contains a single layer it can be read with raster() to create a RasterLayer object . However, if it is a raster with multiple layers, stack() (RasterStack) or brick() (RasterBrick) should be used. RasterBrick are usually more efficient to process but they can only point to a single file while RasterStack can be assembled of layers from different file sources. RasterStack objects can also be created from multiple RasterLayer objects. The functions raster, stack and brick take the filename (full path to file or relative path from your current working directory) of the raster as an argument. Please refer to the documentation of these functions to learn more about the various options and file types supported by the raster package.

#Read single layer raster
raster_layer <- raster("path/to/raster_layer.tif")
#Read multi-layer raster
raster_stack <- stack("path/to/raster_stack.tif")

Vector data

In order to read a shapefile from a file on disk we use readOGR from rgdal package, providing the directory where the file is located in dsn and the name of the layer without the .shp extension. Please refer to the documentation of rgdal for supported file type and other options. The data will be stored in a SpatialPointsDataFrame object.

dem_samples <- rgdal::readOGR(dsn = "path/to/directory", 
                              layer = "layer_name")

Write data to disk

Raster data

We can choose to process data in memory or write the output to disk. When dealing with small raster object, it is safe to process and save everything in memory. However, it is advised to write data to disk when processing large datasets. Writing raster data to disk can be done with the function raster::writeRaster taking as arguments at least the name of the raster object and its full output filename (including path and optionally extension type). Whenever calling a function of FOSTER returning a raster object, the filename can be provided directly in the function call and writeRaster will be automatically used to save the output to disk. The functions also usually supports ... arguments where additional parameters controlling writeRaster can be provided (e.g. overwrite to overwrite existing files, format to specify the output file type if not given in filename, bylayer to write each layer of the raster object individually).

#Example to write output of calcIndices to disk using different options
ind <- calcIndices(x, indices = c("NDVI","TCG","TCW"),red = 3,nir=4,filename = "full/path/to/filename.tif")
ind <- calcIndices(x, indices = c("NDVI","TCG","TCW"),red = 3,nir=4,filename = "full/path/to/filename", format="GTiff", overwrite=TRUE, byLayer=TRUE)

Whenever filename is kept to its default value "" (empty character), the output raster will be processed and saved in memory if possible. If the file is too large to be processed in memory and no filename is provided, the raster package will automatically write the output to a temporary folder. The location of the temporary folder and the maximum size that can be processed in memory can be found (among other options) using rasterOptions(). It is possible to change the global options of the raster package by using rasterOptions(optionName) <- optionValue. It is for example recommended to change the default temp directory to easily access it and clear it if necessary.

rasterOptions(tmpdir) <- "path/to/tempdir"

Vector data

The function getSample, getSampleValues, calcIndices and temporalMetrics can return SpatialPointsDataFrame objects. These objects are usually relatively small and can be easily processed in memory. However, as for raster data, it is possible to provide the name out the output under the filename argument (full path to output, file extension not necessary). Only ESRI Shapefile objects are saved by FOSTER hence any other file extension provided in filename would be overwritten by .shp

Optimize computing performance

The functions calcIndices, temporalMetrics and predictTrgs support parallel processing. To enable parallel processing you need to set par = TRUE and the number of parallel threads threads. When parallel processing is performed, the input data is divided in chunks and each cluster processes a chunk at a time. For calcIndices and temporalMetrics you can specify the number of chunks each cluster will process with the argument m (the raster will be divided into m x threads blocks).

For predictTrgs, controlling how data is processed is slightly different. Memory issues can occur when processing too much data at the same time because large matrices need to be stored in memory. The argument nrows specifies the number of rows that will be processed at a time (or per cluster when using parallel processing). By default nrows = 200 which may not be the optimum value depending on the size of the dataset and available memory on the computer. In general, increasing nrows will speed up computing but also increase risks of running into memory issues. In order to choose the best value of nrows, it is suggested to make some test runs and monitor the memory usage of the computer to see if should be increased or decreased.

Description of the example

We illustrate how FOSTER can be used by imputing two ALS-derived variables: the 95th percentile of first returns height (elev_p95) and canopy cover above mean height (cover). elev_p95 and cover have been calculated on a 20 m x 20 m grid from an ALS point cloud. For this example we assume that only three 500 m x 4 km ALS stripes are available. The goal is to impute these two ALS metrics on a 4km x 4 km area only partially covered by the ALS stripes.

We will use the following predictor variables:

The imputation in FOSTER is based on a Random Forest k-NN model (measure of nearness based on the Random Forest proximity matrix). The imputation is based on the yaImpute package.

Input data

ALS-derived forest attributes maps

We load elev_p95 and cover from .tif files and stack them in a RasterStack object Y_vars using stack() from raster package.

elev_p95 <- raster(system.file("extdata/vignette/ALS_metrics/ALS_metrics_p95.tif",package="foster"))
cover <- raster(system.file("extdata/vignette/ALS_metrics/ALS_metrics_cov_mean.tif",package="foster"))
Y_vars <- stack(elev_p95,cover)

#Set layers names
names(Y_vars) <- c("p95","cover")
Y_vars
#> class      : RasterStack 
#> dimensions : 200, 201, 40200, 2  (nrow, ncol, ncell, nlayers)
#> resolution : 20, 20  (x, y)
#> extent     : 584960, 588980, 5811720, 5815720  (xmin, xmax, ymin, ymax)
#> crs        : +proj=utm +zone=10 +datum=NAD83 +units=m +no_defs 
#> names      :      p95,    cover 
#> min values : 0.300000, 5.038168 
#> max values :  43.9250,  91.0518
plot(Y_vars)

Multispectral data - Annual Landsat composites

Multispectral data is derived from a 3 year time-series (2006 - 2008) of Landsat surface reflectance composite images (30 m x 30 m resolution). We load the data from 2006 as an example:

spectral_2006 <- stack(system.file("extdata/vignette/Landsat_BAP/Landsat_BAP_2006.tif",package="foster"))
names(spectral_2006) <- c("blue","green","red","nir","swir1","swir2")

spectral_2006
#> class      : RasterStack 
#> dimensions : 133, 134, 17822, 6  (nrow, ncol, ncell, nlayers)
#> resolution : 30, 30  (x, y)
#> extent     : 584955, 588975, 5811729, 5815719  (xmin, xmax, ymin, ymax)
#> crs        : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs 
#> names      : blue, green,  red,  nir, swir1, swir2 
#> min values :    7,    89,   59,   42,    13,    18 
#> max values :  988,  1133, 1421, 4364,  3285,  2258
plot(spectral_2006, col = grey.colors(255))

Topographic data

Elevation (DEM) and terrain slope (DEM_slope) data are derived from the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) global Digital Elevation Model (GDEM, v.2).Both DEM and DEM_slope were resampled to a 30 m spatial resolution and aligned with the multispectral data grid.

dem <- raster(system.file("extdata/vignette/topo/DEM.tif",package="foster"))
dem_slope <- raster(system.file("extdata/vignette/topo/DEM_slope.tif",package="foster"))

plot(dem)

plot(dem_slope)

Mask of forested areas

A mask of forested areas was derived from a landcover dataset of 30 m spatial resolution and aligned with multispectral data grid. Cells have the value 1 if forested and NA otherwise.

mask_forest <- raster(system.file("extdata/vignette/landcover/VLCE_forest_2008.tif",package="foster"))
plot(mask_forest)

Data preparation

In this example, the ALS metrics (20 x 20 m) and predictor variables (30 x 30 m) have different spatial resolutions.

In order to integrate the ALS metrics and predictor variables, the first step consists in resampling Y_vars in order to match the spatial resolution, CRS and origin of the Landsat data

Y_vars_resampled <- matchResolution(x = Y_vars,
                                    ref = spectral_2006,
                                    method='bilinear')
#> Warning in matchResolution.Raster(x = Y_vars, ref = spectral_2006, method = "bilinear"): x and ref don't have the same CRS. x is projected to ref CRS before
#>             resampling
Y_vars_resampled
#> class      : RasterBrick 
#> dimensions : 133, 134, 17822, 2  (nrow, ncol, ncell, nlayers)
#> resolution : 30, 30  (x, y)
#> extent     : 584955, 588975, 5811729, 5815719  (xmin, xmax, ymin, ymax)
#> crs        : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs 
#> source     : memory
#> names      :      p95,    cover 
#> min values : 0.342500, 6.296798 
#> max values : 41.32672, 89.81183

The response variables have now a spatial resolution of 30 m x 30 m and are aligned on the Landsat images grid.

Smoothing response and predictor variables can help reducing noise and potential spatial errors between Landsat and ALS data and improve estimation accuracy. In this example, we smooth data by assigning to each cell the mean of its 3x3 neighbors. The function focalMultiBand requires a weight matrix that is used to define the size of the window and weights to apply to neighboring cells. Here we use a 3x3 weight matrix with weights set to 1. Next, we provide the function that is applied to the neighboring cells values (here mean). Using na.rm = TRUE allows mean to be calculated even if NA values occur in the neighborhood. Using pad=TRUE with padValues = NA creates additional rows and columns of NA values around the borders of Y_vars_resampled in order to keep the original raster extent. Finally, using keepNA = TRUE is useful to assign back NA values to cells that had a NA value in Y_vars_resampled but got assigned a non-NA value when applying the spatial filter. Multispectral data will be smoothed later on, after calculating spectral indices.

filt <- matrix(1,nrow=3,ncol=3)
Y_vars_smooth <- focalMultiBand(Y_vars_resampled,
                                w=filt,
                                fun=mean,
                                pad=TRUE,
                                padValue=NA, 
                                na.rm=TRUE, 
                                keepNA = TRUE)
plot(Y_vars_smooth)

From now, on we will focus only on cells that are classified as forests by the land cover map. In order to mask ALS metrics cells that are not forested we need to make sure that the ALS metrics and land cover maps have the same extent. This is taken care of internally by the function matchExtent

Y_vars_mask <- matchExtent(Y_vars_smooth,
                           mask_forest,
                           mask=TRUE,
                           maskValue = NA)
plot(Y_vars_mask)


# We do the same with the DEM and slope

dem_mask <- matchExtent(dem, 
                        mask_forest, 
                        mask = TRUE)

dem_slope_mask <- matchExtent(dem_slope, 
                        mask_forest, 
                        mask = TRUE)

Temporal summary metrics of spectral indices time series

Spectral indices time series

The function calcIndices is used to calculate spectral indices from multispectral data. The spectral indices calculation is based on the RStoolbox package. The list of supported spectral indices and their formula is described in the documentation of the spectralIndices function of the RStoolbox package (available here. The layer index corresponding to the blue, green, red, nir, swir1, swir3 bands need to be provided depending on the indices calculated.

Tasseled Cap brightness (TCB), greenness (TCG) and wetness (TCW) can also be calculated. Calculation of tasseled cap indices is based on the function tasseledCap from the RSToolbox package. The name of the satellite that acquired the data has to be provided under sat and the bands have to be ordered in a specific order as explained in the documentation of RStoolbox::tasseledCap (bands 1, 2, 3, 4, 5, 7 for Landsat5TM).

For this example, we calculate the Tasseled Cap Brightness (TCB), Greenness (TCG), Wetness (TCW) and NDVI. NDVI requires red and nir bands only.

indices_list <- c("NDVI","TCB","TCW","TCG")
# Example for one year
VI_2006 <- calcIndices(spectral_2006,
                   indices = indices_list, 
                   sat="Landsat5TM", 
                   red=3, 
                   nir=4)
plot(VI_2006[["TCG"]])

To process multiple years at once, all RasterStack objects (one per year) can be placed in a list.

# List all filenames in time-series order: 2006 to 2008
spectral_ts_files <- list(system.file("extdata/vignette/Landsat_BAP/Landsat_BAP_2006.tif",package = "foster"), 
                    system.file("extdata/vignette/Landsat_BAP/Landsat_BAP_2007.tif",package = "foster"), 
                    system.file("extdata/vignette/Landsat_BAP/Landsat_BAP_2008.tif",package = "foster"))
  
# Open all Landsat images and place them in a list
spectral_ts <- lapply(spectral_ts_files, stack)
spectral_ts
#> [[1]]
#> class      : RasterStack 
#> dimensions : 133, 134, 17822, 6  (nrow, ncol, ncell, nlayers)
#> resolution : 30, 30  (x, y)
#> extent     : 584955, 588975, 5811729, 5815719  (xmin, xmax, ymin, ymax)
#> crs        : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs 
#> names      : Landsat_BAP_2006.1, Landsat_BAP_2006.2, Landsat_BAP_2006.3, Landsat_BAP_2006.4, Landsat_BAP_2006.5, Landsat_BAP_2006.6 
#> min values :                  7,                 89,                 59,                 42,                 13,                 18 
#> max values :                988,               1133,               1421,               4364,               3285,               2258 
#> 
#> 
#> [[2]]
#> class      : RasterStack 
#> dimensions : 133, 134, 17822, 6  (nrow, ncol, ncell, nlayers)
#> resolution : 30, 30  (x, y)
#> extent     : 584955, 588975, 5811729, 5815719  (xmin, xmax, ymin, ymax)
#> crs        : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs 
#> names      : Landsat_BAP_2007.1, Landsat_BAP_2007.2, Landsat_BAP_2007.3, Landsat_BAP_2007.4, Landsat_BAP_2007.5, Landsat_BAP_2007.6 
#> min values :                 33,                 93,                 64,                125,                 20,                 17 
#> max values :                685,               1065,               1211,               4494,               3108,               2135 
#> 
#> 
#> [[3]]
#> class      : RasterStack 
#> dimensions : 133, 134, 17822, 6  (nrow, ncol, ncell, nlayers)
#> resolution : 30, 30  (x, y)
#> extent     : 584955, 588975, 5811729, 5815719  (xmin, xmax, ymin, ymax)
#> crs        : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs 
#> names      : Landsat_BAP_2008.1, Landsat_BAP_2008.2, Landsat_BAP_2008.3, Landsat_BAP_2008.4, Landsat_BAP_2008.5, Landsat_BAP_2008.6 
#> min values :                 19,                 98,                 39,                131,                 17,                 16 
#> max values :                629,                965,               1082,               4251,               2845,               1954
VI_ts <- calcIndices(spectral_ts,
                     indices = indices_list, 
                     sat="Landsat5TM", 
                     red=3, 
                     nir=4)

# The output is a list with VI for each year
VI_ts
#> [[1]]
#> class      : RasterStack 
#> dimensions : 133, 134, 17822, 4  (nrow, ncol, ncell, nlayers)
#> resolution : 30, 30  (x, y)
#> extent     : 584955, 588975, 5811729, 5815719  (xmin, xmax, ymin, ymax)
#> crs        : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs 
#> names      :         NDVI,          TCB,          TCW,          TCG 
#> min values :   -0.3777778,  158.6998000,   23.4080000,  -76.8098000 
#> max values :    0.9267045, 4436.6082000, 2004.1029000, 2850.4314000 
#> 
#> 
#> [[2]]
#> class      : RasterStack 
#> dimensions : 133, 134, 17822, 4  (nrow, ncol, ncell, nlayers)
#> resolution : 30, 30  (x, y)
#> extent     : 584955, 588975, 5811729, 5815719  (xmin, xmax, ymin, ymax)
#> crs        : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs 
#> names      :        NDVI,         TCB,         TCW,         TCG 
#> min values :   0.1261261, 198.2568000,  47.2045000,  -7.9697000 
#> max values :    0.900304, 4265.392400, 2022.956900, 2845.887100 
#> 
#> 
#> [[3]]
#> class      : RasterStack 
#> dimensions : 133, 134, 17822, 4  (nrow, ncol, ncell, nlayers)
#> resolution : 30, 30  (x, y)
#> extent     : 584955, 588975, 5811729, 5815719  (xmin, xmax, ymin, ymax)
#> crs        : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs 
#> names      :         NDVI,          TCB,          TCW,          TCG 
#> min values :    0.1155116,  206.6494000,   21.4235000,    2.0134000 
#> max values :    0.9105449, 3826.3472000, 1813.1592000, 2978.8278000

We can visualize the vegetation indices of the first year of the time series

plot(VI_ts[[1]])

Once indices have been calculated we can smooth them with a 3 x 3 mean filter, as it was done with the resampled ALS metrics maps.

VI_ts_smooth <- focalMultiBand(VI_ts,
                             w=filt,
                             fun=mean,
                             na.rm=TRUE,
                             pad = TRUE,
                             keepNA = TRUE)

We can visualize the smoothed vegetation indices of the first year of the time series

plot(VI_ts_smooth[[1]])

Spectral indices temporal summary metrics

We will now calculate temporal summary metrics of the vegetation indices time series.

The function temporalMetrics requires the name of a function that returns summary metrics of a numeric vector (e.g mean, standard deviation). The default function used by temporalMetrics returns the median, IQR and Theil-Sen slope. However, the user can also define another function that returns metrics specific to its needs. For illustration, we create the function funSummary that returns the same metrics as the default function. Ideally, gap-free composite multispectral images should be used but when creating your own function you can handle the cases where NA values might occur in the time series (by setting the na.rm argument to TRUE or FALSE).

funSummary <- function(x){
  
  c(
    median = median(x,na.rm=TRUE),
    IQR = IQR(x,na.rm=TRUE),
    slope = theilSen(x)
  )
}

We can calculate the temporal summary of the 3 year time series of smoothed vegetation indices created above (VI_ts_smooth)

VI_ts_metrics <- temporalMetrics(VI_ts,
                               metrics="funSummary")
#> Working on NDVI
#> Working on TCB
#> Working on TCW
#> Working on TCG

The output is a RasterStack where each band is a temporal summary metric of a vegetation index. In this example, there are 12 bands (4 vegetation indices, 3 temporal summary metrics).

VI_ts_metrics
#> class      : RasterStack 
#> dimensions : 133, 134, 17822, 12  (nrow, ncol, ncell, nlayers)
#> resolution : 30, 30  (x, y)
#> extent     : 584955, 588975, 5811729, 5815719  (xmin, xmax, ymin, ymax)
#> crs        : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs 
#> names      :   NDVI_median,      NDVI_IQR,    NDVI_slope,    TCB_median,       TCB_IQR,     TCB_slope,    TCW_median,       TCW_IQR,     TCW_slope,    TCG_median,       TCG_IQR,     TCG_slope 
#> min values :  1.261261e-01,  1.286144e-05, -2.372356e-01,  1.982568e+02,  5.820000e-02, -1.034494e+03,  4.720450e+01,  8.410000e-02, -3.770112e+02, -7.969700e+00,  5.355000e-02, -3.133347e+02 
#> max values :     0.9029524,     0.3609146,     0.3609146,  4175.8897000,  1349.0888500,  1100.0643000,  1909.1103000,   635.7759000,   530.8367500,  2794.7694000,   634.9761000,   634.9761000

plot(VI_ts_metrics[["NDVI_median"]])

Stratified random sampling

Sample points need to be extracted from the ALS metrics maps in order to train and test the imputation model.To avoid selecting cells on forested edges or close to ALS extent boundaries we use edges to assign NA values to all cell located in a 3x3 neighborhood of any cells having a NA value.

Y_vars_edges <- edges(Y_vars_mask,
                      w=3)
plot(Y_vars_edges)

In FOSTER, stratification is performed using k-means algorithm which classifies the data in k-means clusters based on the proximity between observations. Random sampling is then performed in each strata, with the number of sampled points in each strata proportional to the occurrence of those strata across the classified raster. For this example, we want to extract nSamples = 230 sample points from nClasses = 5 strata having a minimum distance between each others of at least mindist = 75 meters. Due to the small study area of this example we have to select a relatively low number of samples and set a low mindist requirement. However, it is advised to increase the number of sample points and use a larger mindist to reduce spatial autocorrelation within the samples. We use norm = TRUE to normalize variables prior to k-means clustering. Since we use xy = TRUE the output is a SpatialPointsDataFrame with x and y coordinates added as fields.

The k-means clustering is performed internally with the RStoolbox package.

nSamples = 230
nClasses = 5
mindist = 75

set.seed(1234) #For example reproducibility
sample_strata <- getSample(Y_vars_edges, 
                     layers = c("p95","cover"), 
                     n = nSamples, 
                     strata = nClasses,
                     mindist = mindist, 
                     norm = TRUE,
                     xy = TRUE)
#> cluster 2: 69 samples to select
#> cluster 5: 53 samples to select
#> cluster 3: 47 samples to select
#> cluster 4: 42 samples to select
#> cluster 1: 19 samples to select

The returned sample_strata object is a list containing the sampled points (SpatialPointsDataFrame), the map of k-means strata (RasterLayer) and the k-means model.

# Sampled points
sampleLoc <- sample_strata$sample

# Map of strata
strata_map <- sample_strata$clusterMap

# k-NN model
kmeans_model <- sample_strata$model

We can plot the sampled points on top of the map of strata:

plot(strata_map)
plot(sampleLoc,add=TRUE)

Train a k-NN model

Once all predictor variables response variables have been pre-processed and calculated and reference observations sampled, we have all the data needed to train a k-NN model using trainNN.

We start by putting all the response variables and predictor variables in a RasterStack

# Predictor variables
X_vars <- stack(VI_ts_metrics, 
                dem_mask, 
                dem_slope_mask)

# Response variables

Y_vars <- Y_vars_mask

Then, we extract the values of the predictor and response variables at the sampled reference observations


# Extract values at sample
X_vars_sample <- getSampleValues(X_vars, sampleLoc)
Y_vars_sample <- getSampleValues(Y_vars, sampleLoc)

X_vars_sample
#> class       : SpatialPointsDataFrame 
#> features    : 230 
#> extent      : 585000, 588930, 5812104, 5814894  (xmin, xmax, ymin, ymax)
#> crs         : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs 
#> variables   : 14
#> names       :       NDVI_median,             NDVI_IQR,          NDVI_slope, TCB_median,          TCB_IQR,  TCB_slope, TCW_median,          TCW_IQR, TCW_slope, TCG_median,           TCG_IQR,  TCG_slope,              DEM,         DEM_slope 
#> min values  : 0.657878602084611, 0.000112942552103035, -0.0435291975380104,   926.3602, 1.45570000000009, -360.48505,   424.9076, 1.03430000000003, -143.5368,   662.8405, 0.528500000000008, -234.09925, 1001.80657958984, 0.379797607660294 
#> max values  : 0.884040995607614,   0.0894371726547331,  0.0820109732728729,  3188.0992,        416.61495,   283.7946,  1447.1688,        172.70425,  138.6612,  2572.5088,         382.62955,  382.62955,             1256,  19.1656856536865
Y_vars_sample
#> class       : SpatialPointsDataFrame 
#> features    : 230 
#> extent      : 585000, 588930, 5812104, 5814894  (xmin, xmax, ymin, ymax)
#> crs         : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs 
#> variables   : 2
#> names       :              p95,            cover 
#> min values  : 2.94165027590505, 37.4980213333663 
#> max values  : 34.6181279186167, 85.8514768214529

In order to assess the accuracy of the kNN imputation, we will split the sampled points into training and validation sets.The function partition can be used for that purpose. Three methods are implemented: "random holdout" where a percentage of the data randomly selected is left out for testing, "group holdout where the data is first grouped by quantiles and a percentage of the data within each group is left out for testing and "kfold" where k folds containing the same percentage of data left out for testing are created (cross-validation).

In this example, we perform a k-fold cross validation with 5 folds. We also split the observations based on the k-means strata obtained from stratified random sampling.

#Create data partition
set.seed(1234) #for example reproducibility
train_idx <- partition(sampleLoc$cluster,
                       type="kfold", 
                       kfold = 5,
                       returnTrain = TRUE)

train_idx
#> $Fold1
#>   [1]   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19
#>  [19]  20  21  22  25  26  27  28  29  30  32  33  34  35  36  37  39  40  41
#>  [37]  42  45  47  48  49  50  52  54  56  59  60  61  62  63  65  66  67  69
#>  [55]  70  71  77  79  81  82  83  84  85  86  87  88  89  90  91  92  93  95
#>  [73]  96  97  98  99 100 101 102 103 104 106 107 108 109 110 111 112 113 114
#>  [91] 116 117 118 119 120 122 123 124 125 126 127 128 130 131 132 135 136 137
#> [109] 138 139 140 141 142 144 145 147 148 149 150 151 152 154 155 156 157 158
#> [127] 160 161 162 163 164 165 168 169 170 171 172 174 175 176 177 179 180 181
#> [145] 182 183 185 186 188 189 190 193 194 195 196 197 199 200 202 203 204 205
#> [163] 206 207 208 209 210 211 213 214 215 216 218 219 220 221 222 223 224 225
#> [181] 226 227 228 229 230
#> 
#> $Fold2
#>   [1]   1   3   4   7   8   9  12  13  14  19  20  21  22  23  24  26  27  28
#>  [19]  29  30  31  32  34  35  36  37  38  39  40  42  43  44  45  46  47  48
#>  [37]  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66
#>  [55]  68  69  72  73  74  75  76  77  78  80  82  83  84  85  87  88  89  90
#>  [73]  91  92  93  94  95  96  97  99 100 101 102 104 105 106 107 108 110 112
#>  [91] 113 115 116 117 118 119 120 121 122 125 126 127 128 129 130 131 132 133
#> [109] 134 135 136 137 138 139 141 142 143 144 145 146 147 148 149 151 152 153
#> [127] 155 157 158 159 161 162 163 165 166 167 169 170 172 173 175 176 178 179
#> [145] 180 181 182 183 184 185 186 187 188 189 190 191 192 193 195 196 197 198
#> [163] 201 202 203 204 206 207 208 211 212 213 214 215 216 217 218 219 220 221
#> [181] 224 225 229 230
#> 
#> $Fold3
#>   [1]   1   2   3   4   5   6   8   9  10  11  12  14  15  16  17  18  20  23
#>  [19]  24  25  27  29  31  32  33  34  36  38  41  42  43  44  45  46  47  48
#>  [37]  49  51  52  53  55  56  57  58  59  61  62  63  64  66  67  68  69  70
#>  [55]  71  72  73  74  75  76  77  78  79  80  81  83  84  85  86  87  88  89
#>  [73]  90  91  93  94  96  98  99 101 102 103 104 105 108 109 110 111 112 113
#>  [91] 114 115 119 120 121 122 123 124 125 126 127 129 130 131 132 133 134 135
#> [109] 136 137 138 139 140 141 142 143 145 146 150 151 152 153 154 156 157 158
#> [127] 159 160 163 164 166 167 168 169 170 171 173 174 175 177 178 180 181 182
#> [145] 184 185 187 189 191 192 193 194 195 196 197 198 199 200 201 202 204 205
#> [163] 206 207 209 210 211 212 213 215 216 217 218 219 220 221 222 223 225 226
#> [181] 227 228 229 230
#> 
#> $Fold4
#>   [1]   1   2   3   4   5   6   7   8   9  10  11  13  14  15  16  17  18  19
#>  [19]  21  22  23  24  25  26  28  29  30  31  33  35  36  37  38  39  40  41
#>  [37]  43  44  46  47  48  49  50  51  52  53  54  55  57  58  60  63  64  65
#>  [55]  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  86
#>  [73]  89  91  92  93  94  95  96  97  98 100 102 103 105 106 107 108 109 110
#>  [91] 111 113 114 115 116 117 118 121 122 123 124 127 128 129 132 133 134 135
#> [109] 138 140 141 143 144 145 146 147 148 149 150 151 153 154 155 156 157 159
#> [127] 160 161 162 163 164 165 166 167 168 169 171 172 173 174 175 176 177 178
#> [145] 179 180 181 182 183 184 186 187 188 190 191 192 194 195 196 197 198 199
#> [163] 200 201 203 204 205 208 209 210 212 214 217 219 221 222 223 224 225 226
#> [181] 227 228 230
#> 
#> $Fold5
#>   [1]   1   2   5   6   7  10  11  12  13  15  16  17  18  19  20  21  22  23
#>  [19]  24  25  26  27  28  30  31  32  33  34  35  37  38  39  40  41  42  43
#>  [37]  44  45  46  50  51  53  54  55  56  57  58  59  60  61  62  64  65  66
#>  [55]  67  68  70  71  72  73  74  75  76  78  79  80  81  82  84  85  86  87
#>  [73]  88  90  92  94  95  97  98  99 100 101 103 104 105 106 107 109 111 112
#>  [91] 114 115 116 117 118 119 120 121 123 124 125 126 128 129 130 131 133 134
#> [109] 136 137 139 140 142 143 144 146 147 148 149 150 152 153 154 155 156 158
#> [127] 159 160 161 162 164 165 166 167 168 170 171 172 173 174 176 177 178 179
#> [145] 183 184 185 186 187 188 189 190 191 192 193 194 198 199 200 201 202 203
#> [163] 205 206 207 208 209 210 211 212 213 214 215 216 217 218 220 222 223 224
#> [181] 226 227 228 229

We can then train a random forest (RF) k-NN model specifying the method randomForest and optionally setting up the number of trees and the number of parameters mtry evaluated at each nodes of the RF trees. We consider only the nearest neighbor by setting up k = 1. If we supply samples that should go to training, trainNN returns a data.frame with observed and predicted variables of observations that are not in training.

set.seed(1234) #for example reproducibility
kNN <- trainNN(x = X_vars_sample,
              y=Y_vars_sample,
              inTrain = train_idx, 
              k = 1, 
              method = "randomForest",
              ntree = 200)
#> Loading required namespace: randomForest

The output of traiNN is a list with an element model, which is the trained kNN model, and an element preds which is a data frame with predictions and observations of the sets used for validation.

kNN_model <- kNN$model

kNN_preds <- kNN$preds
head(kNN_preds, 10)
#>      ID variable      obs    preds Fold
#> 1   116    cover 77.49748 71.48406    1
#> 2   116      p95 21.20986 15.99713    1
#> 3   118    cover 55.25654 56.25958    1
#> 4   118      p95 28.13454 24.08258    1
#> 5  1231    cover 39.20789 37.49802    1
#> 6  1231      p95  2.94165  3.04840    1
#> 7   145    cover 77.56484 72.18119    1
#> 8   145      p95 28.84438 25.91934    1
#> 9   160    cover 68.39973 63.37382    1
#> 10  160      p95 20.95739 19.23610    1

Accuracy assessment

We can use the function accuracy to compute a few accuracy summary metrics

accuracy(obs = kNN_preds$obs, 
         preds = kNN_preds$preds, 
         vars = kNN_preds$variable, 
         folds = kNN_preds$Fold)
#> Registered S3 method overwritten by 'cli':
#>   method     from         
#>   print.boxx spatstat.geom
#> # A tibble: 12 x 8
#> # Groups:   folds [6]
#>    folds vars     R2  RMSE     bias RMSE_rel bias_rel count
#>    <chr> <fct> <dbl> <dbl>    <dbl>    <dbl>    <dbl> <int>
#>  1 1     p95   0.488  4.96 -2.76        23.7 -13.2       45
#>  2 1     cover 0.584  7.00 -1.21        10.7  -1.85      45
#>  3 2     p95   0.438  5.85 -0.146       28.9  -0.721     46
#>  4 2     cover 0.361  8.21  0.00709     12.9   0.0112    46
#>  5 3     p95   0.294  5.48  0.510       27.9   2.59      46
#>  6 3     cover 0.329  8.36 -1.21        13.2  -1.91      46
#>  7 4     p95   0.663  4.82 -0.762       24.9  -3.94      47
#>  8 4     cover 0.542  7.57 -1.09        12.3  -1.77      47
#>  9 5     p95   0.357  5.76  0.0207      29.0   0.104     46
#> 10 5     cover 0.535  7.77  0.0547      11.9   0.0838    46
#> 11 all   p95   0.448  5.38 -0.627       26.9  -3.03     230
#> 12 all   cover 0.470  7.78 -0.690       12.2  -1.08     230

The function scatter can be used to create a scatterplot (ggplot2 objects) of the predicted against observed values of the testing set

plots_scatter <- scatter(obs = kNN_preds$obs, 
         preds = kNN_preds$preds, 
         vars = kNN_preds$variable)

plots_scatter$cov


plots_scatter$p95

We can also get the most important variables of the RF-based kNN model and plot them in a boxplot or in a heatmap

imp <- varImp(kNN_model,scaled=FALSE,plot=TRUE,plotType="boxplot")
imp$plot

imp <- varImp(kNN$model,scaled=TRUE,plot=TRUE,plotType="grid")
imp$plot

Impute response variables at targets

Once the k-NN model has been trained and all predictor variables have been calculated at targets, we can impute response variables at targets using predictTrgs. This function requires the trained model and a raster where each layer is one of the model predictor variable. Name of layers should exactly match the name of the predictor variables used during training. It returns one raster layer per response variable and optionally the nearest neighbor ID and distance of each target cell (identified by the rownames of Y_vars_sample observations used for training the kNN model).

Y_imputed <- predictTrgs(model=kNN_model,
                         x = X_vars, 
                         nnID = TRUE, 
                         nnDist = TRUE)
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#> 
Y_imputed
#> class      : RasterBrick 
#> dimensions : 133, 134, 17822, 4  (nrow, ncol, ncell, nlayers)
#> resolution : 30, 30  (x, y)
#> extent     : 584955, 588975, 5811729, 5815719  (xmin, xmax, ymin, ymax)
#> crs        : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs 
#> source     : C:/Users/queinnec.stu/AppData/Local/Temp/Rtmp8GcHpl/raster/r_tmp_2021-03-29_154702_29836_76704.grd 
#> names      :        p95,      cover,      nnID1,    nnDist1 
#> min values :    2.94165,   37.49802,   10.00000,    0.00000 
#> max values :   34.61813,   85.85148, 5661.00000,    0.90500

A raster object with 4 layers is returned: the first layer corresponds to the imputed p95, the second layer to cover and the third and fourth ones to the ID and NN distance of each reference observation the response variables have been imputed from.

plot(Y_imputed$p95)

plot(Y_imputed$cover)

plot(Y_imputed$nnID1,col=rainbow(length(unique(Y_imputed$nnID1))))

plot(Y_imputed$nnDist1)