DISCUSSION
Sequence clustering approaches are routinely used for the identification
of MOTUs in metabarcoding studies, and they often resort to methods
based on similarity values. Still, selecting a clustering threshold for
a given marker more than often relies on common practices and rules of
thumb rather than on proper scientific argument. By analyzing extensive
sequence data deposited in public databases for a range of generalist
and specialist markers, we showed that different threshold values can be
selected depending on the marker and on the criterion favored by
researchers. All the markers we examined are situated in non-protein
coding genes (Table S1), and this has an influence on levels of sequence
intraspecific diversity. The 10% quantile of the within-species
similarity probability distribution was almost always lower than the
0.97 clustering threshold traditionally used in barcoding for markers
targeting protein-coding genes like COI (Hebert et al. 2003), or for
microbial MOTU delimitation (Bálint et al. 2016), indicating that some
level of over-splitting can occur at this threshold.
Although for all the markers the within-genus similarity values were
generally lower than the within-species similarities, the overlap
between the two distributions was dependent on the generalist vs.
specific nature of the marker. For some specific markers (e.g. Coll01
and Olig01), distinct peaks were visible for the two similarity metrics
(Figure 2). Within-species similarities generally were
>0.90, while within-genus values were lower, frequently
below 0.80. Such a pattern is not unexpected for markers with an
excellent taxonomic resolution and designed to identify taxa at the
species level. Conversely for the generalist markers, within-species and
within-genus similarity probability distributions largely overlapped and
the differences between the peaks were minimal. Nevertheless, even for
these markers, the density of within-species similarity was consistently
higher than that of within-genus similarity at high clustering
thresholds, indicating that the probability of observing the
corresponding similarity value is higher within species than within
genera. In other words, at high clustering thresholds, a MOTU is more
likely to represent a species than a genus. This result is confirmed by
the fact that the percentage of MOTUs containing a single species is
always higher than 50%, whatever the clustering threshold or the marker
considered (Figure 4).
The sequences used as a primary source of information in this study were
downloaded from EMBL, and our results are thus highly dependent on the
quality of the data deposited in this public database. Even though
broad-scale analyses suggest that these data are generally reliable
(Leray et al. 2019), errors in the sequence itself (e.g. wrong
nucleotide, or more complex errors like insertions, deletions,
inversions, duplications or pseudogene sequences) and taxonomic
mislabeling can occur in public sequence databases, especially for
organisms which are difficult to identify based on morphology (Bridge et
al. 2003, Bidartondo 2008, Valkiūnas et al. 2008, Mioduchowska et al.
2018). While the first type of error will affect within-species sequence
similarity negatively, sometimes substantially, the effect of the second
type is more diffuse. For example, in a group like springtails where
species delimitation is tricky (Porco et al. 2012), the existence of
cryptic species will decrease within-species sequence similarity while
increasing over-splitting rates. In a group like Bacteria, type strains
are sometimes entered at the species level in the NCBI (EMBL) taxonomy
(Federhen 2015), leading to an inflation of within-genus similarity and
over-merging rates. In every case though, database errors will make
within-species and within-genus similarities distributions more
difficult to distinguish and clustering thresholds trickier to identify,
thus the over-splitting or over-merging rates reported here could be
artificially higher than in reality.
In this work, we came up with a global measure of the error associated
with a given clustering threshold, that we called the “summed error”.
We calculate it by summing over-splitting and over-merging rates,
assuming both have the same cost for biodiversity studies. However, it
is possible to assign a differential weight to over-splitting and
over-merging. For instance, if the aim is to reach conservative
estimated of alpha diversity (i.e. avoid over-splitting), more weight
can be assigned to over-splitting rate. Conversely, if the aim is to
tease apart closely related species, that differ in their sensitivity to
environmental stressors or in threat levels, one may prefer to avoid
over-merging, particularly when extensive reference databases are
available (Roy et al. 2019, Lopes et al. 2021).
For most of the markers we examined, the summed error approach provided
relatively clear results, and identified a range of threshold values
that minimized the summed error. For instance, for Euka02, the summed
error was relatively low at thresholds between 0.96 and 0.99 (Figure 5),
indicating a good trade-off between over-merging and over-splitting.
Interestingly, this range of values was also highlighted by the analysis
of probability distributions (Figure 3, Table S3). Indeed, 0.96 is the
threshold minimizing over-splitting for Euka02 while 0.99 is the balance
(midpoint) threshold. The consistency of values obtained with very
different approaches supports the robustness of our conclusions.
However, for a few markers, the threshold values minimizing summed error
yielded somewhat less clear patterns. For Fung02, the summed error rate
was rather constant (36-37%) at all the thresholds between 0.91 and
0.98, while it quickly increased for higher clustering thresholds. For
Coll01 and Oligo01, the summed error rate showed multiple minima, some
of which at very low clustering thresholds (Figure 5). In principle,
increasing the threshold value should determine a monotone decrease of
over-merging, and a monotone increase of over-splitting (Figure 1B).
However, at low similarity values this was not always the case (Figure
5). This probably occurs because a very large number of sequences have
pairwise similarities of 0.80-0.85 for these markers (Figure 2), and
this might affect the identification of clusters, with some sequences
clustering together e.g. at 0.85 but not at 0.86 similarity values. We
also note that these similarity values match the ones corresponding to
the intersection between the within-genus and within-species
similarities for these markers (Figure 3). It is also possible that, at
this level of sequence similarity, there is strong uncertainty between
MOTUs representing different hierarchical levels of taxonomy.
Our results provide quantitative data that can help researchers set
their optimal clustering thresholds, and understand the consequences of
choosing low or high threshold values. If a clear minimum exists for the
summed error rate, it probably represents an excellent trade-off between
over-merging and over-splitting. In this sense, a threshold value
ranging from 0.96 to 0.99 is probably appropriate for both Bact02 and
Euka02, while Arth02 should accommodate a slightly higher range
(0.98-0.99) and a fixed threshold of 0.97 seems to be more suitable for
Sper01. For Inse01, lower threshold values (0.94-0.96) are more
judicious. All these values match with those obtained on the basis of
within-species and within-genus similarities (Figure 3). However, for
Coll01, Oligo01 and Fung02, the summed error rate does not provide clear
indications, and within-species and within-genus similarity
distributions (e.g. midpoint between modes) might be more informative to
set the threshold value (Figures 2 and 3).
The selection of clustering thresholds can have strong effect in the
estimates of MOTUs richness (Figure 4), still it is important to
remember that it often does not have a tremendous effect on the
ecological message conveyed by metabarcoding data. For instance, Clare
et al. (2016) examined different clustering thresholds to analyze
dietary overlap between skinks and shrews in Mauritius. Although high
clustering thresholds yielded a larger number of MOTUs, ecological
conclusions remained rather consistent overall. Therefore, provided that
appropriate parameters are considered (e.g. alpha diversity measured
using Hill’s numbers with q > 0 instead of richness, beta
diversity estimates), the interpretation of data can be relatively
robust (Clare et al. 2016, Roy et al. 2019, Calderón-Sanou et al. 2020,
Mächler et al. 2021). Nevertheless, we discourage the blind application
of one single clustering threshold like the classical 0.97, as it can
have very different meaning across markers, and can inflate MOTU
richness for fast-evolving markers. Instead, we advocate the ad-hoc
definition of the most appropriate thresholds, on the basis of research
aims, on the potential costs of over-splitting and over-merging, and on
the features of the studied markers.