Introduction
DNA metabarcoding studies are typically based on a succession of
experimental steps governed by important methodological choices (Zinger
et al. 2019). These include a) the definition of sampling design and
selection of sampling sites (Dickie et al. 2018), b) the approach used
for the preservation of the starting material (Tatangelo et al. 2014,
Guerrieri et al. 2021), c) the protocol used for DNA extraction
(Taberlet et al. 2012, Eichmiller et al. 2016, Zinger et al. 2016, Lear
et al. 2018, Capo et al. 2021), d) the selection of appropriate primers
to amplify a taxonomically-informative genomic region (Elbrecht et al.
2016, Fahner et al. 2016, Ficetola et al. 2021), e) the strategy adopted
for DNA amplification and high-throughput sequencing of amplicons
(Nichols et al. 2018, Taberlet et al. 2018, Bohmann et al. 2022), f) the
pipeline selected for bioinformatics analyses (Boyer et al. 2016,
Calderón-Sanou et al. 2020, Capo et al. 2021, Couton et al. 2021, Macher
et al. 2021, Mächler et al. 2021), and g) the statistical approach used
to translate metabarcoding data into ecological information (Paliy and
Shankar 2016, Chen and Ficetola 2020). Each of these methodological
choices can heavily influence the reliability and interpretation of
results (Alberdi et al. 2018, Zinger et al. 2019), and there is thus a
critical need for the development, proper assessment and optimization of
methods specially dedicated to DNA metabarcoding.
When analyzing metabarcoding data, bioinformatic pipelines generally
produce a list of detected sequences, that can be assigned to a given
taxon with a more or less precise taxonomic resolution. However, the
number of unique sequences obtained after bioinformatic treatment is
generally much higher than the number of taxa actually present in the
sample (Calderón-Sanou et al. 2020, Mächler et al. 2021). This stems
from multiple reasons including genuine intraspecific diversity of the
selected markers and errors occurring during the amplification or
sequencing steps. Consequently, sequence clustering approaches are often
used to collapse very similar sequences into one single Molecular
Operational Taxonomic Unit (MOTU), which does not necessarily correspond
to a species in the traditional sense (Kopylova et al. 2016, Froslev et
al. 2017, Bhat et al. 2019, Antich et al. 2021). Sequence clustering can
be performed using similarity thresholds, Bayesian approaches, or
through single-linkage (Antich et al. 2021). Approaches based on
similarity thresholds can have excellent performance and they display
several advantages such as flexibility and easy implementation (Kopylova
et al. 2016, Wei et al. 2021). However, two key parameters have to be
determined a priori when performing clustering based on sequence
similarity. The first one is the sequence to be selected as
representative of the cluster. In the case of metabarcoding studies,
keeping the most abundant sequence of the cluster as the cluster
representative is a convenient way of merging sequence variants
generated during the PCR or sequencing steps with the original sequence
they derive from (Mercier et al. 2013). The second parameter is the
similarity threshold (clustering threshold) used to build MOTUS (Clare
et al. 2016, Calderón-Sanou et al. 2020, Wei et al. 2021). Choosing this
threshold is delicate without prior knowledge of the maker and its
intrinsic level of diversity. A too low threshold can collapse different
taxa into the same MOTU (over-merging), while a too high threshold can
create too many MOTUs (over-splitting) compared to the actual diversity
levels (Clare et al. 2016, Roy et al. 2019, Schloss 2021).
Some works suggest that the ecological interpretation of metabarcoding
data can be relatively robust to the threshold selected for sequence
clustering. For instance, Botnen et al. (2018) used thresholds ranging
from 0.87 to 0.99 of sequence similarity to analyze multiple microbial
communities, and they obtained community structures highly coherent
across thresholds. Nevertheless, levels of alpha diversity can be
heavily impacted by the threshold selection. Ideally, the threshold used
for clustering would depend on a trade-off between MOTU over-splitting
and MOTU over-merging. A growing number of markers are currently being
used in metabarcoding studies (Taberlet et al. 2018), with some allowing
broad-scale biodiversity assessment but having limited taxonomic
resolution (e.g. 18S rDNA primers amplifying all eukaryotes; Guardiola
et al. 2015) and others being highly specific to one single class or
even family (e.g. Baamrane et al. 2012, Ficetola et al. 2021).
Biodiversity surveys generally aim to generate a set of MOTUs that are
each associated with a unique taxon, and with all taxa situated at the
same level in the taxonomic tree, to facilitate comparisons. In these
conditions, optimal clustering thresholds probably strongly differ
across markers. One can for example expect high similarity thresholds
for highly conserved markers, and lower clustering thresholds for
markers showing high intraspecific variability (Kunin et al. 2010, Brown
et al. 2015). However, there is limited quantitative assessment of how
optimal clustering thresholds vary across markers (but see Alberdi et
al. 2018).
In this study, we analyzed sequences from a public database (EMBL) to
identify clustering thresholds for different markers and under different
criteria. We considered eight metabarcoding markers (Table 1), ranging
from generalist ones (e.g. a 16S rDNA-based marker targeting Bacteria
and a 18S rDNA-based marker targeting Eukaryota) to more specific
markers (e.g. markers specific of earthworms, insects or springtails).
We evaluated how clustering thresholds can change for each taxonomic
group, depending on the criterion adopted to set the threshold. We used
two alternative strategies to identify thresholds, each time with
different objectives in mind. First, following a procedure similar to
the one adopted in barcoding studies (Meyer and Paulay 2005), we
compared the distribution probabilities of sequence similarities among
different individuals of the same species and among different species of
the same genus to identify thresholds: i ) minimizing the risk
that different sequences of the same species are split in different
MOTUs (i.e. risk of over-splitting); ii ) minimizing the risk that
distinct but related species are clustered in the same MOTU (i.e. risk
of over-merging); iii ) balancing the risk of over-splitting and
over-merging (Figure 1A). Second, we calculated the over-splitting and
over-merging rates of the studied markers for a range of clustering
thresholds, to identify values that minimize the two error rates (Figure
1B). We expect that, if researchers want to minimize over-splitting,
they should select lower clustering thresholds than if they want to
minimize over-merging. Furthermore, we expect higher clustering
threshold values for generalist markers compared to markers targeting
one class or more restricted taxonomic groups, because of the lower
taxonomic resolution and slower evolutionary rate of the former.