Rates of over-splitting and over-merging
For all markers, whatever the clustering threshold examined (values ≥
0.70 for Coll01, ≥ 0.80 for Olig01 and ≥ 0.90 for the other markers),
the percentage of MOTUs containing one single species was higher than
50%, and that of MOTUs containing one single genus was higher or close
to 70% (Figure 4). Overall, for the generalist and intermediate
markers, these two percentages showed a regular increase with the
clustering threshold, and for the specific markers, they tended to
values close to 100% for high thresholds. Unsurprisingly, the two
percentages tended to be lower for the generalist markers than for the
specific markers at a given threshold, indicating that the former are
more sensitive to over-merging. Fung02 was a notable exception, since
about 87% and 97% of MOTUs contained one single species and one single
genus, respectively, at the 0.97 threshold, which is a frequently
adopted clustering threshold for fungal ITS sequences. These values were
comparable to those observed for the specific markers, for which
> 85% and > 98% of MOTUs contained one
single species or one single genus, respectively, for thresholds ≥ 0.95.
For all markers, the percentages of species and genera gathered in one
single MOTU decrease both at a similar rate with the clustering
threshold, with generally a sharp drop at high thresholds (≥ 0.98;
Figure 4). However, the pattern of MOTU splitting was less
characteristic of generalist vs. specific markers. For some markers
(Euka02, Sper01, Arth02, Inse01), the percentage of species or genera
gathered in a single MOTU remained higher or close to 50% up to high
thresholds (0.98). On the contrary, for Bact02, Fung02, Coll01, Olig01,
these percentages dropped quickly when the clustering threshold
increased, indicating that these markers are susceptible to
over-splitting.
For all markers, the number of clusters generally increased regularly
with the clustering threshold up to 0.97-0.98 (Figure 4), followed by a
sharp rise up to 1 (which was however less obvious for Euka02 and
Olig01). For example, for Bact02, the number of clusters more than
doubled between 0.97 (2862 clusters) and 1 (6461 clusters).
Our results showed clear patterns for over-merging and over-splitting
rates, with over-splitting quickly increasing and over-merging quickly
decreasing at high clustering thresholds (Figure 5). For several
markers, the summed error showed a relatively clear minimum at specific
clustering thresholds (Figure 5): 0.96-0.99 for Bact02 and Euka02,
0.97-0.99 for Arth02, 0.94-0.96 for Inse01, and 0.96-0.98 for Sper01.
The minimum was much less evident for Fung02, Coll01 and Oligo01, these
markers showing relatively similar summed error rates over a broad range
of clustering thresholds (Fung02: 0.91-0.98; Coll01: 0.82-0.96, with
multiple minima; Oligo01: 0.84-0.96, with multiple minima).