Fig. 2: Random forest model quality assessed with increasing number of specimens per species. For each number of specimens, 100 data sets were created by random sampling. The OOB error (y-axis) decreases with increasing number of specimens (x-axis) and starts going into saturation. Thus, around 10 specimens per species are generally recommended to obtain a high quality model.
Standardization of data processing
Different steps throughout data processing can have a severe impact on classification results. The effect of changing the different data processing steps was evaluated using the RF OOB error as an indicator. For each data set a RF model was trained and the OOB error recorded (supplementary figure 1). Whereas alteration of baseline subtraction iterations generally only had little impact on RF OOB error, changing HWS and SNR had greater effects (supplementary figure 1). The GAM shows that the OOB error is significantly influenced by alteration of the HWS (Table 1, p-value: 0.007) and SNR (Table 1, p-value: 0.001). A combination of 22 baseline estimation iterations, HWS of 7 and SNR of 3 resulted in the lowest OOB error of 0.032. These settings were used for further analyses.
Table. 1: Results of the GAM analyses to detect the most important variable for data processing optimization.