3.2 Algorithms comparison
The variation of the performance metrics across the ten simulation runs
is illustrated by the box plots in Figure 4. Indeed, the small range of
each boxplot indicated little sensitivity to pseudo-absence generation
across the ten runs for both seasons. The mean values of the six
performance metrics (accuracy, kappa, sensibility, sensitivity,
specificity, F1 score and TSS) calculated from the 10-fold
cross-validation were high (mean range: 0.81-0.99) for the 14 algorithms
for both seasons (Fig. 4), showing good predictive performance. Based on
the six performance metrics, the best model was the Random Forest (RF)
for both seasons, with values ranging from 0.93 to 0.99.
When comparing the tuned RF and the stacking method during the dry
season, the tuned RF approach had slightly but significantly better
performance metrics compared to the stacking for the accuracy (mean:
0.978 vs. 0.970), the specificity (mean: 0.980 vs. 0.970), the F1 score
(mean: 0.978 vs. 0.970) and the TSS (mean: 0.957 vs. 0.941)
(Kruskal-Wallis test, p <0.05, Fig. 5a,d,e,f). However,
no significant difference was observed between both methods for the wet
season (Kruskal-Wallis test, p >0.05, Fig. 5). Given
such low differences, both methods were used to generate predictions of
the whales’ potential distribution for the wet and dry season
separately.