Figure captions
Figure 1. Different approaches to identify the most appropriate clustering thresholds. A): approaches based on similarities between sequences belonging to different individuals from the same species (blue curve), and similarities between sequences belonging to different species from the same genus (red curve). One can choose to minimize the risk that different sequences from the same species are split in different MOTUs (over-splitting risk; e.g. 10% quantile of the distribution of within-species similarities), the risk that sequences from different species belonging to the same genus are clustered in the same MOTU (over-merging risk; e.g. 90% quantile of within-genus similarities), or one can try to find a balance between the risks of over-splitting and over-merging (e.g. with the intersection between probability distributions, or the midpoint between the modes of both distributions). B) Approaches based on rates of over-splitting and over-merging. One can compare the over-splitting (blue) and the over-merging (red) rates, and/or one can identify the thresholds minimizing the sum of these rates (violet).
Figure 2. Density probability distributions of sequence pairwise similarities within species (blue lines) and within genera (red lines) for the eight studied markers. For each marker, dotted lines represent the 10% quantile of the within-species probability distribution (blue; threshold limiting over-splitting), the 90% quantile of the within-genus probability distribution (red; threshold limiting over-merging), the intersection of the within-species and within-genus probability distributions (green, balance-a) and the midpoint between modes (black, balance-b)
Figure 3. Different possible clustering thresholds for the eight studied markers, depending on the selected criterion.
Figure 4. Evolution of over-splitting and over-merging rates for a range of clustering thresholds, for the eight studied markers. The left y-axes report percentage values; the right y-axes indicate the number of obtained clusters.
Figure 5. Over-splitting (blue) and over-merging (red) rates, as well as the summed error rate (i.e. over-splitting rate + over-merging rate; violet), for the eight studied markers across a range of clustering thresholds.