Sequence analyses and taxonomic assignment
To evaluate the current completeness of the online database for the teleo region of the 12S mitochondrial DNA, an in silico PCR with 3 allowed mismatches using the teleo primers sequences was performed with ecoPCR ((29) Ficetola et al 2010) on the EMBL database (European Molecular Biology Laboratory,www.ebi.ac.uk, version 138, downloaded on January 2019, ((30) Baker et al 2000). The generated list of sequenced species was compared to the checklists of fish species present in in the Bird’s Head of Papua region, provided by courtesy of (17) Kulbicki et al. 2013.
The amplified DNA sequences from the water samples were processed following two metabarcoding workflows. The first workflow used the OBITools software package ((31) Boyer et al 2016) based on direct taxonomic assignment of the sequences using the ecotag program (lower common ancestor algorithm) in EMBL database as a reference (see details in Supplementary materials).
A total of 394 species are sequenced in the Bird’s Head region (24.5%, Suppl table 1). The selection of similarity thresholds for taxonomic assignment must be based on the length of the barcode and its intra taxonomic variability. We tested the resolution of the marker by running an in silico PCR on all fish mitochondrial DNA present in EMBL online database (downloaded the 20th of January 2019). All amplified sequences were aligned using Clustal W algorithm ((32) Larkin et al 2007) and their identity percentage calculated using Geneious R6.1.8 ((33) Kearse et al 2012). The analysis of this alignments supports the following thresholds with few false assignments at those taxonomical levels: 100-98%, 90-98%, 85-90% and 80-85% bp similarity to assign species, genus, family and order respectively. All the sequences with an assignment similarity lower than 80% were discarded from the analyses.
The second metabarcoding workflow was based on the SWARM clustering algorithm that groups multiple variants of sequences into OTUs (Operational Taxonomic Units, (12) Mahé et al 2014, see details in Supplementary materials).
The SWARM clustering workflow was used to investigate the taxa present in the samples but not revealed by the taxonomic assignment process because of gaps in the EMBL database. The number of taxa assigned in each family was corrected to avoid taxonomical redundancy assignment. For instance, the combined assignments to the genus Zanclus and the species Zanclus cornutus were considered as one taxa as potential PCR error may have produced two different assignment levels from the same sequence. These corrected numbers of taxa were then compared to the number of OTUs from the SWARM workflow in each family to evaluate the magnitude of the diversity missed by the direct assignment method. In the SWARM workflow, a family level assignment was performed as well to remove the taxa that were not fish from nonspecific amplifications and investigate the intra family diversity.