In the previous articles you have learned the core functions of SDMtune for training, evaluating and display the output of a model. In this article you will learn how to perform data-driven variable selection.
We extract 10000 background locations using the function randomPoints included in the
dismo package (we set the seed to have reproducible results). After we create an
SWD() object using the
set.seed(25) bg <- dismo::randomPoints(predictors, 10000) #> Warning in dismo::randomPoints(predictors, 10000): generated random points = #> 0.9775 times requested number bg <- prepareSWD(species = "Bgs", a = bg, env = predictors, categorical = "biome") #> Extracting predictor information for absence/background locations... #> Info: 9 absence/background locations are NA for some environmental variables, they are discarded!
The environmental variables we downloaded have a coarse resolution and the function can extract a bit less than 10000 random locations (see the warning message).
With the function
plotCor() you can create an heat map showing the degree of autocorrelation:
plotCor(bg, method = "spearman", cor_th = 0.7)
You can select a different correlation method or set a different correlation threshold. Another useful function is
corVar() that instead of creating a heat map prints the pairs of correlated variables according to the given method and correlation threshold:
corVar(bg, method = "spearman", cor_th = 0.7) #> Var1 Var2 value #> 1 bio1 bio6 0.9513541 #> 2 bio12 bio16 0.9447559 #> 3 bio6 bio7 -0.8734498 #> 4 bio1 bio8 0.8459649 #> 5 bio16 bio6 0.7471269 #> 6 bio6 bio8 0.7286723 #> 7 bio1 bio7 -0.7119135 #> 8 bio16 bio7 -0.7027568 #> 9 bio1 bio16 0.7023585
As you can see there are few variables that have a correlation coefficient greater than 0.7 in absolute value.
There are cases in which a model has some environmental variables ranked with very low contribution and you may want to remove some of them to reduce the model complexity. SDMtune offers two different strategies implemented in the function
reduceVar(). We will use the
maxnet_model trained with all the variables. Let’s first check the permutation importance (we use only one permutation to save time):
varImp(maxnet_model, permut = 1) #> Variable Permutation_importance #> 1 bio1 56.1 #> 2 bio8 19.5 #> 3 bio17 6.3 #> 4 biome 4.7 #> 5 bio12 3.5 #> 6 bio5 3.5 #> 7 bio7 3.2 #> 8 bio6 1.6 #> 9 bio16 1.5
We will use the function
reduceVar() only for demonstration purpose. In the first example we want to remove all the environmental variables that have a permutation importance lower than 6%, no matter if the model performance decreases. The function removes the last ranked environmental variable, trains a new model and computes a new rank. The process is repeated until all the remaining environmental variables have an importance greater than 6%:
cat("Testing AUC before: ", auc(maxnet_model, test = test)) #> Testing AUC before: 0.8505888 reduced_variables_model <- reduceVar(maxnet_model, th = 6, metric = "auc", test = test, permut = 1) #> Removed variables: bio16, bio6, bio7, bio5, bio17, biome cat("Testing AUC after: ", auc(reduced_variables_model, test = test)) #> Testing AUC after: 0.8520787
In the second example we want to remove the environmental variables that have a permutation importance lower than 15% only if removing the variables the model performance does not decrease, according to the given metric. In this case the function performs a leave one out Jackknife test and remove the environmental variables in a step-wise fashion as described in the previous example, but only if the model performance doesn’t drop:
cat("Testing AUC before: ", auc(maxnet_model, test = test)) #> Testing AUC before: 0.8505888 reduced_variables_model <- reduceVar(maxnet_model, th = 15, metric = "auc", test = test, permut = 1, use_jk = TRUE) #> Removed variables: bio16, bio6, bio5, bio12, biome, bio8 cat("Testing AUC after: ", auc(reduced_variables_model, test = test)) #> Testing AUC after: 0.8533188
As you can see in this case several variables have been removed and the AUC in the testing dataset didn’t decrease.
Reduce the number of variables using the model trained with the cross validation and using as metric the TSS. Highlight the following cell for the solution:
# You need to pass TRUE to the test argument selected_variables_model <- reduceVar(cv_model, th = 6, metric = "tss", test = TRUE, permut = 1)
In this article you have learned:
In the next article you will learn how to tune the model hyperparameters.