Fig. 1: Results of the sample preparation test. All graphs show peak
intensity on the y-axis. In A, x-axis represents tissue:matrix ratio in
µg per µl. In B C and D m/z values (ratio of molecule mass and loading)
are depicted on the x-axis. A) Maximum intensities as a measure of
quality for the different sample to HCCA matrix ratios assessed for four
species. Additionally, for Cancer pagurus a dilution series
(brown) was carried out. B) Good quality spectrum at a ratio of 3.12 µg
tissue per µl matrix. C) Lower quality spectrum at 0.39 µg
µl-1 showing a high baseline. D) Lower quality
spectrum at 25 µg µl-1 showing stronger noise.
Optimize Random Forest model for classification
For application of RF as a method for
classification, we evaluated how strongly the number of specimens per
species influences model error. A repeated (n=100) random sampling of
two to eleven specimens for species with at least 11 specimens in the
data set (n=20) was carried out. This data was then used to create RF
models and the OOB error was assessed as a quality criterion. Increasing
the number of specimens per species resulted in a decrease of OOB error
(Fig. 2). With only two specimens per species the OOB error ranges from
0 to 0.375 with a mean error of 0.18 (SD = 0.073). With eleven specimens
per species, the error ranges from 0.005 to 0.036 with a mean error of
0.019 (SD = 0.008). The decrease in OOB error goes nearly into
saturation for n >10. For further analyses, we chose n = 6
because the results show a strong decrease in OOB-error variability and
a strong decrease in maximum OOB error at this point.