Overcoming incompleteness of genetic reference databases
Environmental DNA metabarcoding has the potential to surpass most classical survey methods to assess biodiversity in both terrestrial and aquatic systems ((35) Deiner et al 2017). Yet, genetic reference databases are often incomplete especially for species-rich ecosystems such as the Coral Triangle, the global marine biodiversity hotspot ((14) Veron et al 2009). For instance, the current completeness of the 12S rDNA online databases for the teleo primer covers only 24.5% of fish species in the Bird’s Head Peninsula. Meanwhile this cover reaches 77.3% for the COI (mitochondrial cytochrome c oxidase subunit I) but fish COI primers still perform poorly in comparison to 12S markers ((36) Collins et al 2019).
With around 28% of families, 54% of the genera and 76% of species not sequenced for the 12S rDNA teleo primers region, the largest part of fish diversity in the Bird’s Head peninsula remains thus hidden through direct assignment. Additionally, sequences present in the reference online databases may have been collected from individuals not located in the region of interest. This can induce assignment errors due to biogeographical related genetic variation (e.g. (37) Wadrop et al 2016). The lack of sequencing coverage highlights the immense gap to be filled for online databases to be exhaustive, while numerous species still remain to be described ((38) Pinheiro et al 2019). This limitation prevents metabarcoding approaches from characterizing entire fish assemblages through direct species assignment. Yet, the taxa-assignment method reveals the presence of 211 fish species referenced in the checklist of coastal fishes in the Bird’s Head peninsula (Fig. 2a). Conversely, 99 assigned species were absent from this checklist. These 99 detections can either be true presences extending the distribution of some species and revisiting the regional checklist or false presences due to wrong assignments or possible contaminations. For instance, the Atlantic salmon (Salmo salar ), probably a lab kit contaminant, was found in our study and removed from the analyses (see Methods). The high number of species found in the samples but not present in the checklist of the Bird’s Head region suggests that inventories of some families are still incomplete. On average 2.5 detected species per family (± 2.6 SD, Fig.2b) are missing in the checklist with a variation between 0 to 14 species (Apogonidae). This mismatch allows to target future sampling efforts towards families and their habitats to complete the regional checklist.
As an alternative to species assignment, the use of OTUs as species proxy units is an option that has not yet been tested for vertebrates in species-rich ecosystems while currently used when the concept of species is debatable like for fungi or unicellular organisms ((39) Pawlowski et al 2018, (40) Lladó Fernández et al 2019).
Here, using a conservative and stringent bioinformatic pipeline, we show that the diversity of OTUs is a weak and biased estimator of species diversity with species-rich families being strongly underrepresented. To overcome this limitation, we propose to rely on OTU accumulation curves which provide an unbiased estimate of regional fish diversity and fish richness within families. The asymptotes underestimate the regional fish species richness but the bias is highly consistent among families (Figure 5f). We thus propose to extend this method for taxonomic inventories in poorly-sampled ecosystems like the deep sea to estimate the diversity at different taxonomic levels.