Introduction
DNA metabarcoding studies are typically based on a succession of experimental steps governed by important methodological choices (Zinger et al. 2019). These include a) the definition of sampling design and selection of sampling sites (Dickie et al. 2018), b) the approach used for the preservation of the starting material (Tatangelo et al. 2014, Guerrieri et al. 2021), c) the protocol used for DNA extraction (Taberlet et al. 2012, Eichmiller et al. 2016, Zinger et al. 2016, Lear et al. 2018, Capo et al. 2021), d) the selection of appropriate primers to amplify a taxonomically-informative genomic region (Elbrecht et al. 2016, Fahner et al. 2016, Ficetola et al. 2021), e) the strategy adopted for DNA amplification and high-throughput sequencing of amplicons (Nichols et al. 2018, Taberlet et al. 2018, Bohmann et al. 2022), f) the pipeline selected for bioinformatics analyses (Boyer et al. 2016, Calderón-Sanou et al. 2020, Capo et al. 2021, Couton et al. 2021, Macher et al. 2021, Mächler et al. 2021), and g) the statistical approach used to translate metabarcoding data into ecological information (Paliy and Shankar 2016, Chen and Ficetola 2020). Each of these methodological choices can heavily influence the reliability and interpretation of results (Alberdi et al. 2018, Zinger et al. 2019), and there is thus a critical need for the development, proper assessment and optimization of methods specially dedicated to DNA metabarcoding.
When analyzing metabarcoding data, bioinformatic pipelines generally produce a list of detected sequences, that can be assigned to a given taxon with a more or less precise taxonomic resolution. However, the number of unique sequences obtained after bioinformatic treatment is generally much higher than the number of taxa actually present in the sample (Calderón-Sanou et al. 2020, Mächler et al. 2021). This stems from multiple reasons including genuine intraspecific diversity of the selected markers and errors occurring during the amplification or sequencing steps. Consequently, sequence clustering approaches are often used to collapse very similar sequences into one single Molecular Operational Taxonomic Unit (MOTU), which does not necessarily correspond to a species in the traditional sense (Kopylova et al. 2016, Froslev et al. 2017, Bhat et al. 2019, Antich et al. 2021). Sequence clustering can be performed using similarity thresholds, Bayesian approaches, or through single-linkage (Antich et al. 2021). Approaches based on similarity thresholds can have excellent performance and they display several advantages such as flexibility and easy implementation (Kopylova et al. 2016, Wei et al. 2021). However, two key parameters have to be determined a priori when performing clustering based on sequence similarity. The first one is the sequence to be selected as representative of the cluster. In the case of metabarcoding studies, keeping the most abundant sequence of the cluster as the cluster representative is a convenient way of merging sequence variants generated during the PCR or sequencing steps with the original sequence they derive from (Mercier et al. 2013). The second parameter is the similarity threshold (clustering threshold) used to build MOTUS (Clare et al. 2016, Calderón-Sanou et al. 2020, Wei et al. 2021). Choosing this threshold is delicate without prior knowledge of the maker and its intrinsic level of diversity. A too low threshold can collapse different taxa into the same MOTU (over-merging), while a too high threshold can create too many MOTUs (over-splitting) compared to the actual diversity levels (Clare et al. 2016, Roy et al. 2019, Schloss 2021).
Some works suggest that the ecological interpretation of metabarcoding data can be relatively robust to the threshold selected for sequence clustering. For instance, Botnen et al. (2018) used thresholds ranging from 0.87 to 0.99 of sequence similarity to analyze multiple microbial communities, and they obtained community structures highly coherent across thresholds. Nevertheless, levels of alpha diversity can be heavily impacted by the threshold selection. Ideally, the threshold used for clustering would depend on a trade-off between MOTU over-splitting and MOTU over-merging. A growing number of markers are currently being used in metabarcoding studies (Taberlet et al. 2018), with some allowing broad-scale biodiversity assessment but having limited taxonomic resolution (e.g. 18S rDNA primers amplifying all eukaryotes; Guardiola et al. 2015) and others being highly specific to one single class or even family (e.g. Baamrane et al. 2012, Ficetola et al. 2021). Biodiversity surveys generally aim to generate a set of MOTUs that are each associated with a unique taxon, and with all taxa situated at the same level in the taxonomic tree, to facilitate comparisons. In these conditions, optimal clustering thresholds probably strongly differ across markers. One can for example expect high similarity thresholds for highly conserved markers, and lower clustering thresholds for markers showing high intraspecific variability (Kunin et al. 2010, Brown et al. 2015). However, there is limited quantitative assessment of how optimal clustering thresholds vary across markers (but see Alberdi et al. 2018).
In this study, we analyzed sequences from a public database (EMBL) to identify clustering thresholds for different markers and under different criteria. We considered eight metabarcoding markers (Table 1), ranging from generalist ones (e.g. a 16S rDNA-based marker targeting Bacteria and a 18S rDNA-based marker targeting Eukaryota) to more specific markers (e.g. markers specific of earthworms, insects or springtails). We evaluated how clustering thresholds can change for each taxonomic group, depending on the criterion adopted to set the threshold. We used two alternative strategies to identify thresholds, each time with different objectives in mind. First, following a procedure similar to the one adopted in barcoding studies (Meyer and Paulay 2005), we compared the distribution probabilities of sequence similarities among different individuals of the same species and among different species of the same genus to identify thresholds: i ) minimizing the risk that different sequences of the same species are split in different MOTUs (i.e. risk of over-splitting); ii ) minimizing the risk that distinct but related species are clustered in the same MOTU (i.e. risk of over-merging); iii ) balancing the risk of over-splitting and over-merging (Figure 1A). Second, we calculated the over-splitting and over-merging rates of the studied markers for a range of clustering thresholds, to identify values that minimize the two error rates (Figure 1B). We expect that, if researchers want to minimize over-splitting, they should select lower clustering thresholds than if they want to minimize over-merging. Furthermore, we expect higher clustering threshold values for generalist markers compared to markers targeting one class or more restricted taxonomic groups, because of the lower taxonomic resolution and slower evolutionary rate of the former.