Figure captions
Figure 1. Different approaches to identify the most appropriate
clustering thresholds. A): approaches based on similarities between
sequences belonging to different individuals from the same species (blue
curve), and similarities between sequences belonging to different
species from the same genus (red curve). One can choose to minimize the
risk that different sequences from the same species are split in
different MOTUs (over-splitting risk; e.g. 10% quantile of the
distribution of within-species similarities), the risk that sequences
from different species belonging to the same genus are clustered in the
same MOTU (over-merging risk; e.g. 90% quantile of within-genus
similarities), or one can try to find a balance between the risks of
over-splitting and over-merging (e.g. with the intersection between
probability distributions, or the midpoint between the modes of both
distributions). B) Approaches based on rates of over-splitting and
over-merging. One can compare the over-splitting (blue) and the
over-merging (red) rates, and/or one can identify the thresholds
minimizing the sum of these rates (violet).
Figure 2. Density probability distributions of sequence
pairwise similarities within species (blue lines) and within genera (red
lines) for the eight studied markers. For each marker, dotted lines
represent the 10% quantile of the within-species probability
distribution (blue; threshold limiting over-splitting), the 90%
quantile of the within-genus probability distribution (red; threshold
limiting over-merging), the intersection of the within-species and
within-genus probability distributions (green, balance-a) and the
midpoint between modes (black, balance-b)
Figure 3. Different possible clustering thresholds for the
eight studied markers, depending on the selected criterion.
Figure 4. Evolution of over-splitting and over-merging rates
for a range of clustering thresholds, for the eight studied markers. The
left y-axes report percentage values; the right y-axes indicate the
number of obtained clusters.
Figure 5. Over-splitting (blue) and over-merging (red) rates,
as well as the summed error rate (i.e. over-splitting rate +
over-merging rate; violet), for the eight studied markers across a range
of clustering thresholds.