Markers examined and construction of sequence datasets
We focused on a set of eight DNA metabarcoding markers (Bact02, Euka02,
Fung02, Sper01, Arth02, Coll01, Inse01, Olig01) targeting different
taxonomic groups (Table 1). Four of these markers can be considered as
generalist, i.e. targeting entire superkingdoms or kingdoms: Bact02
targeting Bacteria; Euka02 targeting Eukaryota; Fung02 targeting Fungi;
Sper01 targeting Spermatophyta (vascular plants). One marker was
intermediate (Arth02; targeting arthropods, i.e. the most species-rich
phylum on Earth). Finally, three were more specific, i.e. targeting
groups from classes to subclasses: Coll01 targeting Collembola
(springtails); Inse01 targeting Insecta; Olig01 targeting Oligochaeta
(earthworms).
For each of these markers, a sequence database was built from EMBL
release 140 as follows. An in silico PCR was first carried out by
running the program ecoPCR (Ficetola et al. 2010) using the
corresponding primers (Table S1). Three mismatches per primer were
allowed (-e option), and the amplified amplicon length without primers
was restricted (-l and -L options) to the expected length interval
(Table S1). The amplified sequences were further filtered by keeping
only those belonging to the target taxonomic group, showing a taxonomic
assignment (i.e. taxid) at the species and genus levels and having no
ambiguous nucleotides. This allowed assembling a working dataset, from
which we extracted two sub-datasets. The “within-species” dataset was
built by keeping only species for which at least two sequences
(identical or not) were available; if >2 sequences were
available for a given species, we randomly selected two sequences for
that species. The “within-genus” dataset was built by keeping only
genera for which at least two sequences were available; if
>2 sequences were available for a given genus, we randomly
selected two sequences for that genus. For some markers (Bact02, Euka02,
Fung02, Inse01, Sper01), the within-species dataset and sometimes the
within-genus dataset still contained a very large number of sequences
(>10,000). To limit computation time for these markers, we
randomly selected a subset of 5000 different taxa, to reach a final
number of sequences equal to 10,000. Table S2 summarizes the number of
sequences in the different datasets.