Clustering thresholds determined from probability distributions
of within-species and within-genus sequence similarities
The probability distributions of within-species and within-genus
sequence similarities showed very contrasting patterns between the
generalist and the specific markers (Figure 2). For the five markers
targeting a phylum or broader taxonomic groups (Bact02, Euka02, Fung02,
Sper01, and Arth02), the distributions of within-species and
within-genus similarities were rather similar, both showing a mode at
very high similarity values (Figure 2). Fung02 showed a slightly
different pattern, as the within-genus similarities had a very broad
distribution. Conversely, for the more specific markers, the
distributions of sequence similarities were very different, with two
clearly distinct peaks. Within-species similarities remained very high
(mostly above 0.95), while within-genus similarities generally showed
lower values (mode around 0.90 for Inse01, and below 0.80 for Olig01 and
Coll01).
For all markers, criterion i (avoid over-splitting) yielded the
lowest thresholds (Figure 3, Table S3), with very low levels for Coll01
and Olig01. Conversely, criterion ii (avoid over-merging) yielded
extremely high values, except for Coll01. For all generalist markers,
avoiding over-merging would require setting clustering thresholds at
0.99 or higher. For Coll01, criterion ii resulted in a rather low
threshold (0.765), because many within-genus comparisons showed very low
similarity values.
Criteria iii -a and iii -b searching a balance between
over-merging and over-splitting yielded somehow contrasting results
across markers. For the three specific markers (Coll01, Inse01, and
Olig01), the within-genus and within-species similarities showed clearly
distinct peaks (Figure 2). As a consequence, the intersection between
the two curves could effectively represent the point minimizing both
over-merging and over-splitting (see discussion), and the midpoint
between the modes also identified rather similar threshold values. On
the contrary, for the generalist markers, the within-species and
within-genus similarities showed very high overlap and similar modes,
and the density distributions actually intersected at values lower than
both modes. The midpoint between the modes continued to identify
threshold values intermediate between the peaks of within-species and
within-genus similarities.