Mixed samples
While NGSpeciesID was not designed specifically for metabarcoding data,
the flexibility of the algorithmic steps in the pipeline enables the
tool to handle mixed samples. We recovered seven consensus sequences
corresponding to the seven DNA barcodes pooled in the mixed sample
analysis. NGSpeciesID generated highly accurate consensus sequences for
all barcodes, ranging from 99.2% to 100%. For the mixed sample test we
adjusted the read abundance ratio for the clusters to 5%, since the
seven barcodes at equal abundance are each present in only 14% of the
reads in the sample. Therefore, the default abundance cutoff of 10%
would require 210 out of the 300 reads to be used per cluster, which
might not be the case. Three out of seven barcodes showed a slightly
lower consensus accuracy than in the respective single species analysis,
which is likely due to the presence of some reads from other barcodes in
the clusters that might have affected the polishing accuracy, and the
random selection of the 300 reads for each barcode (as individual read
error rates can differ). We expect some cross-contamination (reads
assigned to the wrong cluster), especially for closely related species.
However, this should improve with the continued improvement of
third-generation sequencing read accuracy. This experiment shows that
NGSpeciesID, even though it was not developed for mixed samples, can
recover highly accurate consensus sequences from metabarcoding data.
However, its performance on metabarcoding data will need to be
investigated separately with mock datasets of varying ratios and sample
relationships (taxonomic divergences).