Eel Species Distribution Modeling Using Boosted Trees

A case study of species distribution modeling, following a modeling project described by Edith et al. 2008.

Author

Affiliation

Published

December 2, 2022

Description:

In this post, I perform a case study example of species distribution modeling in a reproduction of the work done by Edith et al. 2008 [1]. For this case study, we model the short-finned eel (Anguilla australis) species using Boosted Regression Trees. This analysis is derived from an assignment for EDS232: Machine Learning in Environmental Science, as part of the curriculum for UCSB’s Master’s of Environmental Data Science program.

Boosted regression trees, also known as gradient boosting machines, are machine learning techniques that combine concepts of regression trees (relating responses to predictors using recursive binary splits) and iterative boosting (an adaptive method for combining simple models to improve predictive performance). In this case study, decision trees are built sequentially using extreme gradient boosting to classify the presence or absence of short-finned eel in a given location with covariates including temperature, slope, rainy days, etc. The work provided by Edith et al. utilizes the {gbm} package in R, while this analysis follows a {tidymodels} approach in R.

Data

Our data for this analysis, “eel.model.data.csv” were retrived from the supplemental information of Edith et al. 2008. These data include the following variables:

Figure 1: Table 1. from Elith et al. 2008 displaying the variables included in the analysis.

code

# load required libraries
library(tidyverse)
library(rsample)
library(glmnet)
library(tidymodels)
library(ggplot2)
library(sjPlot)
library(pROC)
library(RColorBrewer)


eel_data <- eel_data_raw |> 
  janitor::clean_names() 

# remove site number from modeling data
eel_model <- eel_data[,-1]

# convert presence of angaus, downstream obstructions, and fishing methods to factors
eel_model$angaus <- as.factor(eel_model$angaus)
eel_model$ds_dam <- as.factor(eel_model$ds_dam)
eel_model$method <- as.factor(eel_model$method)

tab_df(eel_data[1:5,],
       title = "Table. 1")

Table. 1
site	angaus	seg_sum_t	seg_t_seas	seg_low_flow	ds_dist	ds_max_slope	us_avg_t	us_rain_days	us_slope	us_native	ds_dam	method	loc_sed
1	0	16.00	-0.10	1.04	50.20	0.57	0.09	2.47	9.80	0.81	0	electric	4.80
2	1	18.70	1.51	1.00	132.53	1.15	0.20	1.15	8.30	0.34	0	electric	2.00
3	0	18.30	0.37	1.00	107.44	0.57	0.49	0.85	0.40	0.00	0	spo	1.00
4	0	16.70	-3.80	1.00	166.82	1.72	0.90	0.21	0.40	0.22	1	electric	4.00
5	1	17.20	0.33	1.00	3.95	1.15	-1.20	1.98	21.90	0.96	0	electric	4.70

Split and Resample

We split the above data into a training and test set, stratifying by the outcome score (angaus) to maintain class balance, improve generalization, and ensure the reliability of evaluation in modeling performances. We then use a 10-fold cross validation to resample the training set, also stratified by outcome score.

code

# set a seed for reproducibility
set.seed(123)

#stratified sampling with the {rsample} package
eel_split <- initial_split(data = eel_model, prop = 0.7, strata = angaus)

eel_train <- training(eel_split)
eel_test <- testing(eel_split)

# 10-fold cross validation, stratified by our outcome variable, angaus 
cv_folds <- eel_train |> 
  vfold_cv(v=10, strata = angaus)

Preprocess

Next, we create a recipe to prepape the data for the XGBoost model. For this analysis, we are interested in predicting the binary outcome variable angaus, which indicates presence or absence of the eel species Anguilla australis.

code

# create a recipe 
boost_rec <- recipe(angaus ~., data = eel_train) |> 
  step_normalize(all_numeric()) |> 
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) |> 
  prep()

# bake to check recipe
baked_train <- bake(boost_rec, eel_train) 
baked_test <- bake(boost_rec, eel_test)

Tuning XGBoost

Tune Learning Rate

To begin, we perform tuning on just the learn_rate parameter.

We create a model specification using {xbgoost} for the estimation, specifying only the learn_rate parameter for tuning.

code

#create a model for specification
learn_spec <- boost_tree(learn_rate = tune()) |> 
  set_engine('xgboost') |> 
  set_mode('classification')

Next, we build a grid to tune our model by using a range of learning rate parameter values.

code

#set a grid to tune hyperparameter values
learn_grid <- expand.grid(learn_rate = seq(0.0001, 0.3, length.out = 30))

Then, we define a new workflow and tune the model using the learning rate grid.

code

#define a new workflow 
learn_wf <- workflow() |> 
  add_model(learn_spec) |> 
  add_recipe(boost_rec)

doParallel::registerDoParallel()
set.seed(123)

#tune the model
learn_tune <- learn_wf |> 
  tune_grid(cv_folds, grid=learn_grid) 

#show the performance of the best models 
show_best(learn_tune, metric = 'roc_auc') |> 
  tab_df(title = 'Table 2',
         digits = 4,
         footnote = 'Top performing models and their associated estimates for various learning rate parameter values.',
         show.footnote=TRUE)

Table 2
learn_rate	.metric	.estimator	mean	n	std_err	.config
0.2690	roc_auc	binary	0.8524	10	0.0143	Preprocessor1_Model27
0.1862	roc_auc	binary	0.8451	10	0.0086	Preprocessor1_Model19
0.2586	roc_auc	binary	0.8444	10	0.0120	Preprocessor1_Model26
0.2380	roc_auc	binary	0.8442	10	0.0128	Preprocessor1_Model24
0.1759	roc_auc	binary	0.8434	10	0.0129	Preprocessor1_Model18
Top performing models and their associated estimates for various learning rate parameter values.

code

#save the best model parameters to be used in future tuning
best_learnrate <- as.numeric(show_best(learn_tune, metric = 'roc_auc')[1,1])

Tune Tree Parameters

Following the tuning of the learning rate parameter, we create a new specification with a set optimized learning rate from our previous optimization. Now, we shift our focus on tuning the tree parameters.

code

#create a new specificatin with set learning rate 
tree_spec <- boost_tree(learn_rate = best_learnrate,
                         min_n = tune(),
                         tree_depth = tune(),
                         loss_reduction = tune(),
                         trees = 3000) |> 
  set_mode('classification') |> 
  set_engine('xgboost')

Again, we create a tuning grid, this time utilizing grid_max_entropy() to get a representative sampling of the parameter space.

code

#specify parameters for the tuning grid 
tree_params <- dials::parameters(min_n(),
                           tree_depth(),
                           loss_reduction())

#set the tuning grid 
tree_grid <- grid_max_entropy(tree_params, size = 20)

#define a new workflow 
tree_wf <- workflow() |> 
  add_model(tree_spec) |> 
  add_recipe(boost_rec)

set.seed(123)
doParallel::registerDoParallel()

#tune the model
tree_tuned <- tree_wf |> 
  tune_grid(cv_folds, grid = tree_grid)

#show the performance of the best models 
show_best(tree_tuned, metric = 'roc_auc') |> 
  tab_df(title = 'Table 3',
         digits = 4,
         footnote = 'Top performing models and their associated estimates for various tree parameter values.',
         show.footnote = TRUE)

Table 3
min_n	tree_depth	loss_reduction	.metric	.estimator	mean	n	std_err	.config
3	4	0.0010	roc_auc	binary	0.8572	10	0.0154	Preprocessor1_Model12
6	11	0.0000	roc_auc	binary	0.8544	10	0.0150	Preprocessor1_Model10
4	2	0.0000	roc_auc	binary	0.8528	10	0.0143	Preprocessor1_Model14
7	9	4.9468	roc_auc	binary	0.8492	10	0.0130	Preprocessor1_Model04
4	15	0.0000	roc_auc	binary	0.8472	10	0.0141	Preprocessor1_Model11
Top performing models and their associated estimates for various tree parameter values.

code

#save the best model parameters to be used for future tuning
best_minn <- as.numeric(show_best(tree_tuned, metric = 'roc_auc')[1,1])

best_treedepth <- as.numeric(show_best(tree_tuned, metric = 'roc_auc')[1,2])

best_lossreduction <- as.numeric(show_best(tree_tuned, metric = 'roc_auc')[1,3])

Tune Stochastic Parameters

We create one final specification, setting the learn rate and tree parameters to their optimal values defined in previous tuning iterations. Our final tuning is of the stochastic parameters.

code

#create a new specificatin with set learning rate and tree parameters
stoch_tune <- boost_tree(learn_rate = best_learnrate,
                         min_n = best_minn,
                         tree_depth = best_treedepth,
                         loss_reduction = best_lossreduction,
                         trees = 3000,
                         mtry = tune(), 
                         sample_size = tune()) |> 
  set_mode('classification') |> 
  set_engine('xgboost')

We set up a tuning grid, again utilizing grid_max_entropy().

code

#set parameters for the tuning grid 
stoch_params <- parameters(finalize(mtry(), eel_train),
                          sample_size = sample_prop())

#set the tuning grid 
stoch_grid <- grid_max_entropy(stoch_params, size = 20)

#define a new workflow 
stoch_wf <- workflow() |> 
  add_model(stoch_tune) |> 
  add_recipe(boost_rec) 

set.seed(123)
doParallel::registerDoParallel()

#tune the model 
stoch_tuned <- stoch_wf |> 
  tune_grid(cv_folds, grid = stoch_grid)

#show the performance of the best models 
show_best(stoch_tuned, metric = 'roc_auc') |> 
  tab_df(title = 'Table 4',
         digits = 4,
         footnote = 'Top performing models and their associated estimates for various stochiastic parameter values.',
         show.footnote = TRUE)

Table 4
mtry	sample_size	.metric	.estimator	mean	n	std_err	.config
3	0.9907	roc_auc	binary	0.8618	10	0.0136	Preprocessor1_Model04
8	0.9930	roc_auc	binary	0.8586	10	0.0129	Preprocessor1_Model19
8	0.7407	roc_auc	binary	0.8574	10	0.0151	Preprocessor1_Model18
2	0.8529	roc_auc	binary	0.8559	10	0.0128	Preprocessor1_Model10
10	0.8646	roc_auc	binary	0.8545	10	0.0137	Preprocessor1_Model06
Top performing models and their associated estimates for various stochiastic parameter values.

code

#save the best model parameters to be used for model finalization
best_mtry <- as.numeric(show_best(stoch_tuned, metric = 'roc_auc')[1,1])
best_samplesize <- as.numeric(show_best(stoch_tuned, metric = 'roc_auc')[1,2])

Now that we have tuned all of our relevant hyperparameters, we assemble a final workflow and do a final fit.

code

#create a final specification with all set optimized parameters 
final_model <-  boost_tree(learn_rate = best_learnrate,
                         min_n = best_minn,
                         tree_depth = best_treedepth,
                         loss_reduction = best_lossreduction,
                         trees = 3000,
                         mtry = best_mtry, 
                         sample_size = best_samplesize) |> 
  set_mode('classification') |> 
  set_engine('xgboost') 

#define a final workflow 
final_wf <- workflow() |> 
  add_model(final_model) |> 
  add_recipe(boost_rec) 

#fit training data on final wf
final_fit <- final_wf |> 
  fit(eel_train)

final_eel_fit <- last_fit(final_model, angaus~., eel_split)

final_pred <- as.data.frame(final_eel_fit$.predictions)

tab_df(head(final_pred),
       title = 'Table 5',
       digits = 4,
       footnote = 'Predictions of Angaus presence on test data.',
       show.footnote = TRUE)

Table 5
.pred_0	.pred_1	.row	.pred_class	angaus	.config
0.9999	0.0001	1	0	0	Preprocessor1_Model1
0.0085	0.9915	2	1	1	Preprocessor1_Model1
0.9810	0.0190	3	0	0	Preprocessor1_Model1
0.9468	0.0532	4	0	0	Preprocessor1_Model1
0.9995	0.0005	9	0	0	Preprocessor1_Model1
0.2945	0.7055	10	1	1	Preprocessor1_Model1
Predictions of Angaus presence on test data.

code

#bind predictions and original data 
eel_test_bind <- cbind(eel_test, final_eel_fit$.predictions)

#remove duplicate column
eel_test_bind <- eel_test_bind[,-1]

#compute a confusion matrix
confusion_matrix <- eel_test_bind |> 
  yardstick::conf_mat(truth = angaus, estimate = .pred_class)

autoplot(confusion_matrix, type = "heatmap") +
  scale_fill_gradient(low = "#C7E9FB", high = "#084594") +
  theme(axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),
        axis.title = element_text(size = 14),
        panel.background = element_rect(fill = "#F8F8F8"),
        plot.background = element_rect(fill = "#F8F8F8")) +
  labs(title = "Figure 2: Confusion matrix of predictions on test data.")

Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

code

# store accuracy metrics 
final_metrics <- final_eel_fit$.metrics

tab_df(final_metrics,
       title = 'Table 6',
       digits = 4,
       footnote = 'Accuracy and Area Under the Receiver Operator Curve (ROC) of the final fit.',
       show.footnote = TRUE)

Table 6
.metric	.estimator	.estimate	.config
accuracy	binary	0.81063122923588	Preprocessor1_Model1
roc_auc	binary	0.819193989071038	Preprocessor1_Model1
Accuracy and Area Under the Receiver Operator Curve (ROC) of the final fit.

The final model has an accuracy of 0.80. The ROC area under the curve is 0.82.

Fit model evaluation data and compare performance

code

#read in evaluation data 
eval_data <- read_csv('eel.eval.data.csv', show_col_types = FALSE) |> 
  janitor::clean_names() 

# convert presence of angaus, downstream obstructions, and fishing methods to factors
eval_data$angaus <- as.factor(eval_data$angaus_obs)
eval_data$ds_dam <- as.factor(eval_data$ds_dam)
eval_data$method <- as.factor(eval_data$method)

#fit final model to big dataset
#class predictions
eval_classpred <- final_fit |> 
  predict(eval_data)

#probability predictions
eval_probpred <- final_fit |> 
  predict(eval_data, type = 'prob')

eval_df <- cbind(eval_classpred, eval_probpred, eval_data)

#accuracy measure
accuracy <- accuracy(eval_df, truth = angaus, estimate = .pred_class)
#roc_auc measure 
#roc <- yardstick::roc_auc(eval_df, truth = angaus, estimate = .pred_0)

# metrics <- rbind(accuracy, roc)
# tab_df(metrics, 
#        title = 'Table 7',
#        digits = 4,
#        footnote = 'Accuracy and Area Under the Receiver Operator Curve (ROC) of the model fit to evaluation data.',
#        show.footnote = TRUE)

How does this model perform on the evaluation data?

The final model, fit to the evaluation data, has an accuracy of 0.82, and the ROC area under the curve is 0.85. For comparison, the model produced by Edith et al. had a ROC area under the curve of 0.858.

References

[1] Elith, J., Leathwick, J.R. and Hastie, T. (2008), A working guide to boosted regression trees. Journal of Animal Ecology, 77: 802-813. https://doi.org/10.1111/j.1365-2656.2008.01390.x