Markers examined and construction of sequence datasets
We focused on a set of eight DNA metabarcoding markers (Bact02, Euka02, Fung02, Sper01, Arth02, Coll01, Inse01, Olig01) targeting different taxonomic groups (Table 1). Four of these markers can be considered as generalist, i.e. targeting entire superkingdoms or kingdoms: Bact02 targeting Bacteria; Euka02 targeting Eukaryota; Fung02 targeting Fungi; Sper01 targeting Spermatophyta (vascular plants). One marker was intermediate (Arth02; targeting arthropods, i.e. the most species-rich phylum on Earth). Finally, three were more specific, i.e. targeting groups from classes to subclasses: Coll01 targeting Collembola (springtails); Inse01 targeting Insecta; Olig01 targeting Oligochaeta (earthworms).
For each of these markers, a sequence database was built from EMBL release 140 as follows. An in silico PCR was first carried out by running the program ecoPCR (Ficetola et al. 2010) using the corresponding primers (Table S1). Three mismatches per primer were allowed (-e option), and the amplified amplicon length without primers was restricted (-l and -L options) to the expected length interval (Table S1). The amplified sequences were further filtered by keeping only those belonging to the target taxonomic group, showing a taxonomic assignment (i.e. taxid) at the species and genus levels and having no ambiguous nucleotides. This allowed assembling a working dataset, from which we extracted two sub-datasets. The “within-species” dataset was built by keeping only species for which at least two sequences (identical or not) were available; if >2 sequences were available for a given species, we randomly selected two sequences for that species. The “within-genus” dataset was built by keeping only genera for which at least two sequences were available; if >2 sequences were available for a given genus, we randomly selected two sequences for that genus. For some markers (Bact02, Euka02, Fung02, Inse01, Sper01), the within-species dataset and sometimes the within-genus dataset still contained a very large number of sequences (>10,000). To limit computation time for these markers, we randomly selected a subset of 5000 different taxa, to reach a final number of sequences equal to 10,000. Table S2 summarizes the number of sequences in the different datasets.