Rates of over-splitting and over-merging
For all markers, whatever the clustering threshold examined (values ≥ 0.70 for Coll01, ≥ 0.80 for Olig01 and ≥ 0.90 for the other markers), the percentage of MOTUs containing one single species was higher than 50%, and that of MOTUs containing one single genus was higher or close to 70% (Figure 4). Overall, for the generalist and intermediate markers, these two percentages showed a regular increase with the clustering threshold, and for the specific markers, they tended to values close to 100% for high thresholds. Unsurprisingly, the two percentages tended to be lower for the generalist markers than for the specific markers at a given threshold, indicating that the former are more sensitive to over-merging. Fung02 was a notable exception, since about 87% and 97% of MOTUs contained one single species and one single genus, respectively, at the 0.97 threshold, which is a frequently adopted clustering threshold for fungal ITS sequences. These values were comparable to those observed for the specific markers, for which > 85% and > 98% of MOTUs contained one single species or one single genus, respectively, for thresholds ≥ 0.95.
For all markers, the percentages of species and genera gathered in one single MOTU decrease both at a similar rate with the clustering threshold, with generally a sharp drop at high thresholds (≥ 0.98; Figure 4). However, the pattern of MOTU splitting was less characteristic of generalist vs. specific markers. For some markers (Euka02, Sper01, Arth02, Inse01), the percentage of species or genera gathered in a single MOTU remained higher or close to 50% up to high thresholds (0.98). On the contrary, for Bact02, Fung02, Coll01, Olig01, these percentages dropped quickly when the clustering threshold increased, indicating that these markers are susceptible to over-splitting.
For all markers, the number of clusters generally increased regularly with the clustering threshold up to 0.97-0.98 (Figure 4), followed by a sharp rise up to 1 (which was however less obvious for Euka02 and Olig01). For example, for Bact02, the number of clusters more than doubled between 0.97 (2862 clusters) and 1 (6461 clusters).
Our results showed clear patterns for over-merging and over-splitting rates, with over-splitting quickly increasing and over-merging quickly decreasing at high clustering thresholds (Figure 5). For several markers, the summed error showed a relatively clear minimum at specific clustering thresholds (Figure 5): 0.96-0.99 for Bact02 and Euka02, 0.97-0.99 for Arth02, 0.94-0.96 for Inse01, and 0.96-0.98 for Sper01. The minimum was much less evident for Fung02, Coll01 and Oligo01, these markers showing relatively similar summed error rates over a broad range of clustering thresholds (Fung02: 0.91-0.98; Coll01: 0.82-0.96, with multiple minima; Oligo01: 0.84-0.96, with multiple minima).