Introduction
We are in the middle of a biodiversity crisis, in which anthropogenic change is driving many species to extinction, often faster than they can be characterized (see e.g. Ceballos et al., (2020)). The identification of species in our environments is paramount to informing conservation policy and practice. The development of DNA barcoding (Hebert et al., 2003) was a major step towards large-scale characterizations of biodiversity. This technique utilizes amplification of standardized genetic regions to characterize species present within biological samples. Besides the documentation of biodiversity, this method and other amplicon-sequencing technologies have been widely used for monitoring of invasive species, detection of pathogens in environmental samples, and many other applications in taxonomy, medicine or evolutionary biology (e.g. reviewed in Kress et al., (2015)).
Third-generation sequencing is able to sequence millions of single molecules up to several Mbs in lengths (Jain et al., 2018). Currently, two platforms are readily available for DNA barcoding efforts, PacBio’s Sequel II and ONT’s MinION. These platforms offer the advantage of longer reads, at the cost of sequencing errors. While ONT’s MinION still shows higher error rates >5% (Wick et al., 2018), the new PacBio HiFi mode allows for the generation of read with <1% error (Wenger et al., 2019), which will greatly improve the generation of accurate DNA barcodes. Early on, researchers identified the potential of third-generation sequencing platforms for sequencing much longer DNA barcodes than previously possible (see e.g. Krehenwinkel et al., (2019a); Tedersoo et al., (2018); Wurzbacher et al., (2019)). Beside the longer amplicon length, ONT’s MinION also offers the advantage that sequencing can be carried out almost anywhere in the world, due to its small size and affordability (reviewed in Krehenwinkel et al., 2019b). While there has been a considerable software development effort to assemble high-quality amplicon consensus sequences from error-prone ONT MinION reads (see e.g. Maestri et al., 2019; Seah et al., 2020; Srivathsan et al., 2019; reviewed in Krehenwinkel et al., 2019b), only a few software solutions are available for PacBio-based DNA barcodes (see e.g. Wurzbacher et al., 2019). To our knowledge, of these, only the pipeline presented in Wurzbacher et al., 2019 is able to handle both PacBio and ONT sequencing reads.
Here, we present NGSpeciesID a one-software solution for reconstructing high-quality amplicon consensus sequences for both PacBio and ONT sequencing reads. We also investigate the performance of ONT’s Medaka polishing software compared to Racon (Vaser et al., 2017) for MinION based DNA barcoding. Compared to other programs, NGSpeciesID can be easily installed with conda, does not require any specific file name structures, can handle data from both third-generation sequencing types, includes different consensus polishing options and only needs fastq files as input. We show that our tool produces consensus sequences of a similar quality than other software solutions, while reducing the burden to users by requiring little to no additional tools or data reformatting.