Sequence analyses and taxonomic assignment
To evaluate the current completeness of the online database for the
teleo region of the 12S mitochondrial DNA, an in silico PCR with
3 allowed mismatches using the teleo primers sequences was performed
with ecoPCR ((29) Ficetola et al 2010) on the EMBL database (European
Molecular Biology Laboratory,www.ebi.ac.uk, version 138,
downloaded on January 2019, ((30) Baker et al 2000). The generated list
of sequenced species was compared to the checklists of fish species
present in in the Bird’s Head of Papua region, provided by courtesy of
(17) Kulbicki et al. 2013.
The amplified DNA sequences from the water samples were processed
following two metabarcoding workflows. The first workflow used the
OBITools software package ((31) Boyer et al 2016) based on direct
taxonomic assignment of the sequences using the ecotag program (lower
common ancestor algorithm) in EMBL database as a reference (see details
in Supplementary materials).
A total of 394 species are sequenced in the Bird’s Head region (24.5%,
Suppl table 1). The selection of similarity thresholds for taxonomic
assignment must be based on the length of the barcode and its intra
taxonomic variability. We tested the resolution of the marker by running
an in silico PCR on all fish mitochondrial DNA present in EMBL
online database (downloaded the 20th of January 2019).
All amplified sequences were aligned using Clustal W algorithm ((32)
Larkin et al 2007) and their identity percentage calculated using
Geneious R6.1.8 ((33) Kearse et al 2012). The analysis of this
alignments supports the following thresholds with few false assignments
at those taxonomical levels: 100-98%, 90-98%, 85-90% and 80-85% bp
similarity to assign species, genus, family and order respectively. All
the sequences with an assignment similarity lower than 80% were
discarded from the analyses.
The second metabarcoding workflow was based on the SWARM clustering
algorithm that groups multiple variants of sequences into OTUs
(Operational Taxonomic Units, (12) Mahé et al 2014, see details in
Supplementary materials).
The SWARM clustering workflow was used to investigate the taxa present
in the samples but not revealed by the taxonomic assignment process
because of gaps in the EMBL database. The number of taxa assigned in
each family was corrected to avoid taxonomical redundancy assignment.
For instance, the combined assignments to the genus Zanclus and
the species Zanclus cornutus were considered as one taxa as
potential PCR error may have produced two different assignment levels
from the same sequence. These corrected numbers of taxa were then
compared to the number of OTUs from the SWARM workflow in each family to
evaluate the magnitude of the diversity missed by the direct assignment
method. In the SWARM workflow, a family level assignment was performed
as well to remove the taxa that were not fish from nonspecific
amplifications and investigate the intra family diversity.