Variable Selection — varSel • SDMtune

The function performs a data-driven variable selection. Starting from the provided model it iterates through all the variables starting from the one with the highest contribution (permutation importance or maxent percent contribution). If the variable is correlated with other variables (according to the given method and threshold) it performs a Jackknife test and among the correlated variables it removes the one that results in the best performing model when removed (according to the given metric for the training dataset). The process is repeated until the remaining variables are not highly correlated anymore.

Usage

varSel(
  model,
  metric,
  bg4cor,
  test = NULL,
  env = NULL,
  method = "spearman",
  cor_th = 0.7,
  permut = 10,
  use_pc = FALSE,
  interactive = TRUE,
  progress = TRUE,
  verbose = TRUE
)

Arguments

model: SDMmodel or SDMmodelCV object.
metric: character. The metric used to evaluate the models, possible values are: "auc", "tss" and "aicc".
bg4cor: SWD object. Background locations used to test the correlation between environmental variables.
test: SWD. Test dataset used to evaluate the model, not used with aicc and SDMmodelCV objects.
env: rast containing the environmental variables, used only with "aicc".
method: character. The method used to compute the correlation matrix.
cor_th: numeric. The correlation threshold used to select highly correlated variables.
permut: integer. Number of permutations.
use_pc: logical, use percent contribution. If TRUE and the model is trained using the Maxent method, the algorithm uses the percent contribution computed by Maxent software to score the variable importance.
interactive: logical. If FALSE the interactive chart is not created.
progress: logical. If TRUE shows a progress bar.
verbose: logical. If TRUE prints informative messages.

Value

The SDMmodel or SDMmodelCV object trained using the selected variables.

Details

An interactive chart showing in real-time the steps performed by the algorithm is displayed in the Viewer pane.

To find highly correlated variables the following formula is used: $$| coeff | \le cor_th$$

Author

Sergio Vignali

Examples

# Acquire environmental variables
files <- list.files(path = file.path(system.file(package = "dismo"), "ex"),
                    pattern = "grd",
                    full.names = TRUE)

predictors <- terra::rast(files)

# Prepare presence and background locations
p_coords <- virtualSp$presence
bg_coords <- virtualSp$background

# Create SWD object
data <- prepareSWD(species = "Virtual species",
                   p = p_coords,
                   a = bg_coords,
                   env = predictors,
                   categorical = "biome")
#> ℹ Extracting predictor information for presence locations
#> ✔ Extracting predictor information for presence locations [20ms]
#> 
#> ℹ Extracting predictor information for absence/background locations
#> ✔ Extracting predictor information for absence/background locations [44ms]
#> 

# Split presence locations in training (80%) and testing (20%) datasets
datasets <- trainValTest(data,
                         test = 0.2,
                         only_presence = TRUE)
train <- datasets[[1]]
test <- datasets[[2]]

# Train a model
model <- train(method = "Maxnet",
               data = train,
               fc = "l")

# Prepare background locations to test autocorrelation, this usually gives a
# warning message given that less than 10000 points can be randomly sampled
bg_coords <- terra::spatSample(predictors,
                               size = 9000,
                               method = "random",
                               na.rm = TRUE,
                               xy = TRUE,
                               values = FALSE)

bg <- prepareSWD(species = "Virtual species",
                 a = bg_coords,
                 env = predictors,
                 categorical = "biome")
#> ℹ Extracting predictor information for absence/background locations
#> ✔ Extracting predictor information for absence/background locations [63ms]
#> 

if (FALSE) { # \dontrun{
# Remove variables with correlation higher than 0.7 accounting for the AUC,
# in the following example the variable importance is computed as permutation
# importance
vs <- varSel(model,
             metric = "auc",
             bg4cor = bg,
             test = test,
             cor_th = 0.7,
             permut = 1)
vs

# Remove variables with correlation higher than 0.7 accounting for the TSS,
# in the following example the variable importance is the MaxEnt percent
# contribution
# Train a model
model <- train(method = "Maxent",
               data = train,
               fc = "l")

vs <- varSel(model,
             metric = "tss",
             bg4cor = bg,
             test = test,
             cor_th = 0.7,
             use_pc = TRUE)
vs

# Remove variables with correlation higher than 0.7 accounting for the aicc,
# in the following example the variable importance is the MaxEnt percent
# contribution
vs <- varSel(model,
             metric = "aicc",
             bg4cor = bg,
             cor_th = 0.7,
             use_pc = TRUE,
             env = predictors)
vs} # }