Rates of over-merging and over-splitting
For each marker, over-merging and over-splitting rates were evaluated at different clustering thresholds using the within-species dataset described in the paragraph “Markers examined and construction of sequences datasets”. This dataset contains two sequences at random, identical or not, for a number of species belonging to the taxonomic group of interest.
For each within-species dataset, clustering was performed using thesumaclust program (Mercier et al. 2013) with the -n option (normalization by alignment length) based on the sequence similarities first calculated using the sumatra program (see above; Mercier et al. 2013). Threshold values (-t option) ranging from 0.90 to 1 at 0.01 steps were tested for all markers except Coll01 and Olig01 for which wider ranges ([0.70 – 1] and [0.80 – 1], respectively) were selected based on the within-genus and within-species sequence similarity probability distributions determined previously (see Figure 2). Clustered datasets were then explored to calculate five different variables at each clustering threshold: 1) the number of clusters; 2) the percentage of MOTUs containing one single species; 3) the percentage of MOTUs containing one single genus; 4) the percentage of species gathered in one single MOTU; 5) the percentage of genera gathered in one single MOTU. Variables 2 and 3 are indicative of appropriate MOTU merging of sequences at the species and genus levels, respectively, while variables 4 and 5 are indicative of appropriate MOTU splitting at the species and genus levels, respectively.
These values were also used to calculate three measures of error. We defined the over-merging rate as 1 - the percentage of MOTUs containing one single species; and the over-splitting rate as 1 - the percentage of species gathered in one single MOTU. The summed error rate was then calculated as the sum of the over-merging and over-splitting rates. It should be noted that for this estimate, we assigned the same weight to over-splitting and over-merging.