Clustering thresholds determined from probability distributions of within-species and within-genus sequence similarities
The probability distributions of within-species and within-genus sequence similarities showed very contrasting patterns between the generalist and the specific markers (Figure 2). For the five markers targeting a phylum or broader taxonomic groups (Bact02, Euka02, Fung02, Sper01, and Arth02), the distributions of within-species and within-genus similarities were rather similar, both showing a mode at very high similarity values (Figure 2). Fung02 showed a slightly different pattern, as the within-genus similarities had a very broad distribution. Conversely, for the more specific markers, the distributions of sequence similarities were very different, with two clearly distinct peaks. Within-species similarities remained very high (mostly above 0.95), while within-genus similarities generally showed lower values (mode around 0.90 for Inse01, and below 0.80 for Olig01 and Coll01).
For all markers, criterion i (avoid over-splitting) yielded the lowest thresholds (Figure 3, Table S3), with very low levels for Coll01 and Olig01. Conversely, criterion ii (avoid over-merging) yielded extremely high values, except for Coll01. For all generalist markers, avoiding over-merging would require setting clustering thresholds at 0.99 or higher. For Coll01, criterion ii resulted in a rather low threshold (0.765), because many within-genus comparisons showed very low similarity values.
Criteria iii -a and iii -b searching a balance between over-merging and over-splitting yielded somehow contrasting results across markers. For the three specific markers (Coll01, Inse01, and Olig01), the within-genus and within-species similarities showed clearly distinct peaks (Figure 2). As a consequence, the intersection between the two curves could effectively represent the point minimizing both over-merging and over-splitting (see discussion), and the midpoint between the modes also identified rather similar threshold values. On the contrary, for the generalist markers, the within-species and within-genus similarities showed very high overlap and similar modes, and the density distributions actually intersected at values lower than both modes. The midpoint between the modes continued to identify threshold values intermediate between the peaks of within-species and within-genus similarities.