Rates of over-merging and over-splitting
For each marker, over-merging and over-splitting rates were evaluated at
different clustering thresholds using the within-species dataset
described in the paragraph “Markers examined and construction of
sequences datasets”. This dataset contains two sequences at random,
identical or not, for a number of species belonging to the taxonomic
group of interest.
For each within-species dataset, clustering was performed using thesumaclust program (Mercier et al. 2013) with the -n option
(normalization by alignment length) based on the sequence similarities
first calculated using the sumatra program (see above; Mercier et
al. 2013). Threshold values (-t option) ranging from 0.90 to 1 at
0.01 steps were tested for all markers except Coll01 and Olig01 for
which wider ranges ([0.70 – 1] and [0.80 – 1], respectively)
were selected based on the within-genus and within-species sequence
similarity probability distributions determined previously (see Figure
2). Clustered datasets were then explored to calculate five different
variables at each clustering threshold: 1) the number of clusters; 2)
the percentage of MOTUs containing one single species; 3) the percentage
of MOTUs containing one single genus; 4) the percentage of species
gathered in one single MOTU; 5) the percentage of genera gathered in one
single MOTU. Variables 2 and 3 are indicative of appropriate MOTU
merging of sequences at the species and genus levels, respectively,
while variables 4 and 5 are indicative of appropriate MOTU splitting at
the species and genus levels, respectively.
These values were also used to calculate three measures of error. We
defined the over-merging rate as 1 - the percentage of MOTUs containing
one single species; and the over-splitting rate as 1 - the percentage of
species gathered in one single MOTU. The summed error rate was then
calculated as the sum of the over-merging and over-splitting rates. It
should be noted that for this estimate, we assigned the same weight to
over-splitting and over-merging.