DISCUSSION
Sequence clustering approaches are routinely used for the identification of MOTUs in metabarcoding studies, and they often resort to methods based on similarity values. Still, selecting a clustering threshold for a given marker more than often relies on common practices and rules of thumb rather than on proper scientific argument. By analyzing extensive sequence data deposited in public databases for a range of generalist and specialist markers, we showed that different threshold values can be selected depending on the marker and on the criterion favored by researchers. All the markers we examined are situated in non-protein coding genes (Table S1), and this has an influence on levels of sequence intraspecific diversity. The 10% quantile of the within-species similarity probability distribution was almost always lower than the 0.97 clustering threshold traditionally used in barcoding for markers targeting protein-coding genes like COI (Hebert et al. 2003), or for microbial MOTU delimitation (Bálint et al. 2016), indicating that some level of over-splitting can occur at this threshold.
Although for all the markers the within-genus similarity values were generally lower than the within-species similarities, the overlap between the two distributions was dependent on the generalist vs. specific nature of the marker. For some specific markers (e.g. Coll01 and Olig01), distinct peaks were visible for the two similarity metrics (Figure 2). Within-species similarities generally were >0.90, while within-genus values were lower, frequently below 0.80. Such a pattern is not unexpected for markers with an excellent taxonomic resolution and designed to identify taxa at the species level. Conversely for the generalist markers, within-species and within-genus similarity probability distributions largely overlapped and the differences between the peaks were minimal. Nevertheless, even for these markers, the density of within-species similarity was consistently higher than that of within-genus similarity at high clustering thresholds, indicating that the probability of observing the corresponding similarity value is higher within species than within genera. In other words, at high clustering thresholds, a MOTU is more likely to represent a species than a genus. This result is confirmed by the fact that the percentage of MOTUs containing a single species is always higher than 50%, whatever the clustering threshold or the marker considered (Figure 4).
The sequences used as a primary source of information in this study were downloaded from EMBL, and our results are thus highly dependent on the quality of the data deposited in this public database. Even though broad-scale analyses suggest that these data are generally reliable (Leray et al. 2019), errors in the sequence itself (e.g. wrong nucleotide, or more complex errors like insertions, deletions, inversions, duplications or pseudogene sequences) and taxonomic mislabeling can occur in public sequence databases, especially for organisms which are difficult to identify based on morphology (Bridge et al. 2003, Bidartondo 2008, Valkiūnas et al. 2008, Mioduchowska et al. 2018). While the first type of error will affect within-species sequence similarity negatively, sometimes substantially, the effect of the second type is more diffuse. For example, in a group like springtails where species delimitation is tricky (Porco et al. 2012), the existence of cryptic species will decrease within-species sequence similarity while increasing over-splitting rates. In a group like Bacteria, type strains are sometimes entered at the species level in the NCBI (EMBL) taxonomy (Federhen 2015), leading to an inflation of within-genus similarity and over-merging rates. In every case though, database errors will make within-species and within-genus similarities distributions more difficult to distinguish and clustering thresholds trickier to identify, thus the over-splitting or over-merging rates reported here could be artificially higher than in reality.
In this work, we came up with a global measure of the error associated with a given clustering threshold, that we called the “summed error”. We calculate it by summing over-splitting and over-merging rates, assuming both have the same cost for biodiversity studies. However, it is possible to assign a differential weight to over-splitting and over-merging. For instance, if the aim is to reach conservative estimated of alpha diversity (i.e. avoid over-splitting), more weight can be assigned to over-splitting rate. Conversely, if the aim is to tease apart closely related species, that differ in their sensitivity to environmental stressors or in threat levels, one may prefer to avoid over-merging, particularly when extensive reference databases are available (Roy et al. 2019, Lopes et al. 2021).
For most of the markers we examined, the summed error approach provided relatively clear results, and identified a range of threshold values that minimized the summed error. For instance, for Euka02, the summed error was relatively low at thresholds between 0.96 and 0.99 (Figure 5), indicating a good trade-off between over-merging and over-splitting. Interestingly, this range of values was also highlighted by the analysis of probability distributions (Figure 3, Table S3). Indeed, 0.96 is the threshold minimizing over-splitting for Euka02 while 0.99 is the balance (midpoint) threshold. The consistency of values obtained with very different approaches supports the robustness of our conclusions.
However, for a few markers, the threshold values minimizing summed error yielded somewhat less clear patterns. For Fung02, the summed error rate was rather constant (36-37%) at all the thresholds between 0.91 and 0.98, while it quickly increased for higher clustering thresholds. For Coll01 and Oligo01, the summed error rate showed multiple minima, some of which at very low clustering thresholds (Figure 5). In principle, increasing the threshold value should determine a monotone decrease of over-merging, and a monotone increase of over-splitting (Figure 1B). However, at low similarity values this was not always the case (Figure 5). This probably occurs because a very large number of sequences have pairwise similarities of 0.80-0.85 for these markers (Figure 2), and this might affect the identification of clusters, with some sequences clustering together e.g. at 0.85 but not at 0.86 similarity values. We also note that these similarity values match the ones corresponding to the intersection between the within-genus and within-species similarities for these markers (Figure 3). It is also possible that, at this level of sequence similarity, there is strong uncertainty between MOTUs representing different hierarchical levels of taxonomy.
Our results provide quantitative data that can help researchers set their optimal clustering thresholds, and understand the consequences of choosing low or high threshold values. If a clear minimum exists for the summed error rate, it probably represents an excellent trade-off between over-merging and over-splitting. In this sense, a threshold value ranging from 0.96 to 0.99 is probably appropriate for both Bact02 and Euka02, while Arth02 should accommodate a slightly higher range (0.98-0.99) and a fixed threshold of 0.97 seems to be more suitable for Sper01. For Inse01, lower threshold values (0.94-0.96) are more judicious. All these values match with those obtained on the basis of within-species and within-genus similarities (Figure 3). However, for Coll01, Oligo01 and Fung02, the summed error rate does not provide clear indications, and within-species and within-genus similarity distributions (e.g. midpoint between modes) might be more informative to set the threshold value (Figures 2 and 3).
The selection of clustering thresholds can have strong effect in the estimates of MOTUs richness (Figure 4), still it is important to remember that it often does not have a tremendous effect on the ecological message conveyed by metabarcoding data. For instance, Clare et al. (2016) examined different clustering thresholds to analyze dietary overlap between skinks and shrews in Mauritius. Although high clustering thresholds yielded a larger number of MOTUs, ecological conclusions remained rather consistent overall. Therefore, provided that appropriate parameters are considered (e.g. alpha diversity measured using Hill’s numbers with q > 0 instead of richness, beta diversity estimates), the interpretation of data can be relatively robust (Clare et al. 2016, Roy et al. 2019, Calderón-Sanou et al. 2020, Mächler et al. 2021). Nevertheless, we discourage the blind application of one single clustering threshold like the classical 0.97, as it can have very different meaning across markers, and can inflate MOTU richness for fast-evolving markers. Instead, we advocate the ad-hoc definition of the most appropriate thresholds, on the basis of research aims, on the potential costs of over-splitting and over-merging, and on the features of the studied markers.