Overcoming incompleteness of genetic reference databases
Environmental DNA metabarcoding has the potential to surpass most
classical survey methods to assess biodiversity in both terrestrial and
aquatic systems ((35) Deiner et al 2017). Yet, genetic reference
databases are often incomplete especially for species-rich ecosystems
such as the Coral Triangle, the global marine biodiversity hotspot ((14)
Veron et al 2009). For instance, the current completeness of the 12S
rDNA online databases for the teleo primer covers only 24.5% of fish
species in the Bird’s Head Peninsula. Meanwhile this cover reaches
77.3% for the COI (mitochondrial cytochrome c oxidase subunit I) but
fish COI primers still perform poorly in comparison to 12S markers ((36)
Collins et al 2019).
With around 28% of families, 54% of the genera and 76% of species not
sequenced for the 12S rDNA teleo primers region, the largest part of
fish diversity in the Bird’s Head peninsula remains thus hidden through
direct assignment. Additionally, sequences present in the reference
online databases may have been collected from individuals not located in
the region of interest. This can induce assignment errors due to
biogeographical related genetic variation (e.g. (37) Wadrop et al 2016).
The lack of sequencing coverage highlights the immense gap to be filled
for online databases to be exhaustive, while numerous species still
remain to be described ((38) Pinheiro et al 2019). This limitation
prevents metabarcoding approaches from characterizing entire fish
assemblages through direct species assignment. Yet, the taxa-assignment
method reveals the presence of 211 fish species referenced in the
checklist of coastal fishes in the Bird’s Head peninsula (Fig. 2a).
Conversely, 99 assigned species were absent from this checklist. These
99 detections can either be true presences extending the distribution of
some species and revisiting the regional checklist or false presences
due to wrong assignments or possible contaminations. For instance, the
Atlantic salmon (Salmo salar ), probably a lab kit contaminant,
was found in our study and removed from the analyses (see Methods). The
high number of species found in the samples but not present in the
checklist of the Bird’s Head region suggests that inventories of some
families are still incomplete. On average 2.5 detected species per
family (± 2.6 SD, Fig.2b) are missing in the checklist with a variation
between 0 to 14 species (Apogonidae). This mismatch allows to target
future sampling efforts towards families and their habitats to complete
the regional checklist.
As an alternative to species assignment, the use of OTUs as species
proxy units is an option that has not yet been tested for vertebrates in
species-rich ecosystems while currently used when the concept of species
is debatable like for fungi or unicellular organisms ((39) Pawlowski et
al 2018, (40) Lladó Fernández et al 2019).
Here, using a conservative and stringent bioinformatic pipeline, we show
that the diversity of OTUs is a weak and biased estimator of species
diversity with species-rich families being strongly underrepresented. To
overcome this limitation, we propose to rely on OTU accumulation curves
which provide an unbiased estimate of regional fish diversity and fish
richness within families. The asymptotes underestimate the regional fish
species richness but the bias is highly consistent among families
(Figure 5f). We thus propose to extend this method for taxonomic
inventories in poorly-sampled ecosystems like the deep sea to estimate
the diversity at different taxonomic levels.