Step 4: Bioinformatic analysis to compare sequencing reads to reference databases
Bioinformatic procedures are required to assign taxonomic identifications to the sequence reads. There are numerous bioinformatic pipelines for analysing amplicon sequencing data including but not limited to QIIME2 (Bolyen et al., 2019), dada2 (Callahan et al., 2016), VSEARCH (Rognes, Flouri, Nichols, Quince, & Mahe, 2016), USEARCH (Edgar, 2010), and OBItools (Boyer et al., 2016), as well as custom pipelines developed specifically for pollen metabarcoding (e.g., Ford and Jones, 2020). All these pipelines share a few main steps, which are not necessarily performed in the same order as described here. 1) In most cases paired-end data is obtained for Illumina sequencing runs. In many pipelines, the forward and reverse reads are merged, low-quality reads removed, and low-quality bases are trimmed from reads. Some paired-end read mergers such as PEAR (Zhang, Kobert, Flouri, & Stamatakis, 2014) also remove adapter sequences (technical sequences added during HTS library preparation). 2) Reads are demultiplexed based on tags used in the metabarcoding primer, or tags that are added to each sample during HTS library preparations. 3) Reads are dereplicated and denoised to cluster sequences together which likely differ only by PCR errors. There are some approaches such as the Amplicon Sequence Variant (ASV) (Callahan et al., 2016) which have been applied to pollen (Casanelles‐Abella et al., 2021; Elliott et al., 2020; Wilson et al., 2021). 4) Taxonomic assignment using reference databases. Finally, 5) the creation of a Taxon table or mOTU table containing the number of reads assigned to each taxon per sample. The Taxon table can be further filtered to remove noise and contaminants based on extraction and PCR controls.
Limitations, technical issues, and progress
One of the ongoing challenges associated with DNA-based identification of species mixtures is quantifying the relative abundance of species in the mixture. The issue of quantification is likely to be relevant for many applications in assessing global ecological change. Many ecosystem changes initially present as changes in species abundances rather than changes in species composition. Solutions to the technical problem of quantification could differ for pollen relative to other sample types, although the issues are similar in many ways. DNA metabarcoding is considered semi-quantitative. Species in high proportions are usually represented by many sequencing reads, although the relationship often deviates from the expected 1:1 ratio (Bell et al., 2019; Marcel Polling et al., 2022). There are several reasons for this deviation. Different pollen types can have different DNA extraction efficiencies, which can be improved with method optimisation. Variation occurs among species in the copy numbers of the DNA barcodes, and this has been well-studied for microbes, with the possibility to correct for these biases (Kembel, Wu, Eisen, & Green, 2012; Lamb et al., 2019; Pawluczyk et al., 2015) and a recent study on mitogenomics of insects has applied corrections for copy number (L. Garrido-Sanz et al., 2021). An additional source of bias comes from differences among species in primer binding efficiency (Pompanon et al., 2012) and biases in DNA polymerase binding efficiencies towards different nucleotide compositions (Nichols et al., 2018). These biases can be reduced with careful primer design and PCR optimisation and can be corrected with a good understanding of the biases.
Amplification-free methods eliminate the PCR biases and have been shown in a handful of studies to be more quantitative than DNA metabarcoding. Whole-genome shotgun (WGS) sequencing has been shown to have improved quantification for pollen (Bell, Petit, et al., 2021) and other mixtures of eukaryote species (Lidia Garrido-Sanz, Senar, & Piñol, 2020). Genome-skimming of organellar DNA from WGS has been shown to be quantitative for pollen (Lang et al., 2019) and other eukaryote mixtures (Bista et al., 2018), and quantification can be improved by correcting for organelle copy number (L. Garrido-Sanz et al., 2021). Reduced-representation sequencing using endonucleases (genotyping-by-sequencing) of plant roots has shown within- and across-species abundances strongly correlate with biomass-based species abundance (Wagemaker et al., 2021). Reverse Metagenomics, the sequencing of samples using MinION long reads while reference sequences come from short read skims, has also been found to be semi-quantitative (Peel et al., 2019). While most DNA-based detection and identification methods are semi-quantitative, there is considerable value in the relative read abundances, which are lost by treating data as presence-absence (Deagle et al., 2019). Improved quantification is expected to become possible in the future with an improved understanding of biases.
A related problem is understanding the sensitivity of DNA metabarcoding and the expected detection limits for species of interest, and the rates of false positives and false negatives. This issue may be particularly relevant to biosurveillance and ecosystem monitoring applications, where researchers and managers may be interested in changes in the presence or absence of low abundance species, such as a rare species becoming extinct or early detections of new incursions of non-native invasive species. Acceptable levels of false positives and false negatives will differ among applications. For example, for a risk-averse strategy detecting invasive species, it is important to avoid false negatives. In contrast, for detecting threatened species, a more risk-averse approach would avoid false positives. In both cases, a level of confidence is needed for the detection of a target species. These issues have been addressed with methods for the eDNA of water samples. For example, eDNA has been combined with site occupancy models to determine the confidence of presence/absence results (Dorazio & Erickson, 2018; Schmidt et al., 2013), and similar methods would be applicable to pollen.
Improved confidence in the presence of a species in a sample can be obtained by understanding the overall rate of false positives and false negatives for the study system and method. Researchers can increase confidence by using field-based and laboratory-based negative controls and positive controls or mock communities and the use of no-library negative controls to quantify sequencing mistag rates (Esling, Lejzerowicz, & Pawlowski, 2015). However, confidence estimates are also lacking for the classification steps in pollen DNA metabarcoding. There is an additional need for developing classification programs with more accurate probabilistic confidence estimates. While this has been attempted several times, available methods do not provide consistent results depending on the gene regions and databases used (Edgar, 2018).
Another challenge for pollen DNA metabarcoding and related methods is the development of reference databases. There are an estimated 450,000 angiosperm species (Pimm & Joppa, 2015), and currently, around 25% of these have publicly available sequences for standard DNA barcodes (Bell, Petit, et al., 2021). Reference libraries have been compiled for standard DNA barcodes for all flowering plants in the UK (Jones, Twyford, et al., 2021) and Canada (Kuzmina et al., 2017). There is ongoing work in other countries to develop national databases. There are fewer references available for non-standard DNA barcodes, plastomes, genome skims, and assembled genomes (Bell, Petit, et al., 2021; Lang et al., 2019). Several large-scale projects are in progress to sequence DNA barcodes, organellar genomes and whole genomes for a large proportion of global biodiversity (Lewin et al., 2018), and therefore, the availability of reference sequences is continually improving. Additional problems occur with the quality of publicly available sequences. While quality control standards for databases such as BOLD are high, databases such as GenBank depend on researchers depositing data to conduct their quality control, and only minimal checks are performed, and many erroneous sequences have been found (Breitwieser, Pertea, Zimin, & Salzberg, 2019). For most studies, it will be necessary to develop a custom, curated database, including sequencing any species in the study system that do not already have reliable sequences on public databases. Filtering and subsetting public reference databases to species of interest (e.g., the regional species pool) can be a helpful step in classification to help avoid misclassifications to closely related species (A. Keller et al., 2020). This step is likely to remain useful, even as reliable public databases become more complete.
The ability to assess global ecological change often relies on the comparison of contemporary data to historical data. Pollen DNA metabarcoding methods have only been developed recently, so there are no equivalent historical datasets. Baseline data from earlier studies based on other methods may not be directly comparable to pollen DNA metabarcoding. For example, studies have shown that the networks assembled through pollen DNA metabarcoding are more connected than those assembled through observations (Arstingstall et al., 2021; Pornon et al., 2017), although networks assembled through traditional identification methods of pollinator-collected pollen may be more like those assembled through DNA metabarcoding, i.e., more connected (Bosch et al., 2009). There is potential to obtain equivalent historical data by analysing historical specimens using DNA sequencing technologies (Gous et al., 2019; Simanonok et al., 2021). Still, assessments are needed to see if there are any biases due to degradation over time. Likewise, pollen in sediments could provide a source of material for comparison with modern pollen DNA metabarcoding or other DNA-based methods (Niemeyer et al., 2017). Finally, as we move into the future, it will be essential to retain and archive specimens for optimum preservation to be reanalysed and compared to future samples.
To date, there has been little standardisation in the methods used for sequence-based characterisation of pollen communities in terms of databases, classifiers, gene regions, or other options. There is also inconsistency in how the methods are described and results reported in the literature making comparisons of techniques difficult. While this allows for greater flexibility and more scope for further method development, standardisation would facilitate consistent benchmarking of procedures to improve confidence. Once again, the fields of aquatic eDNA and microbiomics are more advanced in this regard, with several recommendations and standardised methods available (Loeza‐Quintana, Abbott, Heath, Bernatchez, & Hanner, 2020; Yilmaz et al., 2011). There are many choices on methodology at all stages of a research project, including: gene region(s), DNA extraction, PCR, sequencing technologies, and data analysis. Standardisation becomes possible once available methods have been compared, assessed, and optimised. From this, minimum sets of standards can be determined for replication, negative and positive controls, and optimal choices across sampling, sequencing, and data analysis steps. These standards can then be applied when designing a project and when assessing papers during peer-review. Some methods for the various steps involved in pollen DNA metabarcoding have been compared (Swenson & Gemeinholzer, 2021; Tommasi, Ferrari, Labra, Galimberti, & Biella, 2021), however, there remains a lot of work to be done before any general recommendations can be made.
Future research directions
In addition to the work currently in progress to solve the technical issues discussed in section 4, we have identified several areas where method development on pollen DNA metabarcoding and related methods could open new avenues of research. These include method developments for intraspecific identification, analysis of DNA from ancient pollen, increased use of specimens available in museums and herbaria, and increased use of newer sequencing technologies.
Intraspecific identifications (e.g., populations, individuals) using DNA sequencing of pollen could open new research areas on the role of pollinators in gene flow, and the effects of habitat fragmentation on plant and pollinator health and adaptive potential. Recent developments with eDNA suggest that this is plausible for pollen. For example, eDNA has been used to investigate intraspecific variation in the mitochondrial control region of whale sharks in sea water (Sigsgaard et al., 2016), cytb variation in harbour porpoises (Parsons, Everett, Dahlheim, & Park, 2018), and microsatellite allele frequencies in artificial mesocosms of round gobies (Neogobius melanostomus ) (Andres, Sethi, Lodge, & Andres, 2021). Genomic DNA from Late Pleistocene bears was retrieved from shotgun data showing that it might be possible to use ancient eDNA for intra-species analysis (Pedersen et al., 2021). Applying target capture of mitochondrial and nuclear DNA in eDNA is also a promising avenue of research (Jensen et al., 2021). All the previous investigations into intraspecific variation from eDNA have been attempted in aquatic study systems and cave sediments, and, as far as we know, have not been attempted on plant communities or pollen. Pollen mixtures might be the ideal candidate system to investigate intraspecific variation. Because all individual pollen grains in a mixture represent an individual gametophyte, any method focusing on the genetic content of a single pollen grain can be used to investigate the mixture on a population level. While population level inferences might be possible without sequencing single pollen grains, many population genomics tools require sequence data of individuals. To overcome problems with single pollen grain sequencing, such as low template abundance and uncertainties about whether DNA is coming from inside the pollen or fragments sticking to the outside, it might be beneficial to germinate the pollen to make accessing the DNA easier (Jayaprakash, 2018). Current advances in the read output and length in HTS platforms coupled with the potential to sequence individual pollen grains might advance molecular pollen analysis beyond species identification. As soon as the DNA inside individual pollen grains can be accessed, amplified, and sequenced, the road to a whole new set of research questions can be opened. For example, applying population genomics to not only count the number of species, but also quantify how many individual plants a pollinator has visited.
The high level of preservation of pollen morphology in ancient sedimentary records, means that pollen is often used to understand past ecosystems. Coupling these methods with ancient DNA technology could provide narrower taxonomic identification and more fine scale understanding of past ecosystems. Although it has been demonstrated that DNA is present in ancient pollen and can be sequenced (Bennett & Parducci, 2006; L. Parducci et al., 2005), ancient pollen samples are difficult to process and there is a high risk of contamination with exogenous DNA. There are many different approaches for isolating and cleaning single pollen grains from the abundant pollen usually present in sediments. These include hand pipetting under a microscope, serial dilution, flow-assisted cell sorting (flow cytometry), microfluidic manipulation (Wang & Navin, 2015), flow sorting or micro-manipulation (Kron & Husband, 2012). Potential method development in this area could focus on improved efficiency and contamination control. Once these issues are addressed, there is great promise for ancient pollen analysis to reveal ecosystem change over large time scales.
Likewise, there is currently an unrealised potential to use herbarium and museum specimens to answer questions in global change ecology. Currently, a handful of studies have shown that preserved specimens can be used to study plant-pollinator interactions. Methods have been developed for identifying historical pollen samples from bees using metabarcoding (Gous et al., 2019). Museum specimens can be used to assess resource use over huge temporal scales. For example, Simanonok et al. (2021) identified resource use by Bombus affinis over 100 years, and found that pollen richness and diversity had not changed over time and neither had the abundance of native nor introduced species. This work helps understand drivers behind the decline of native pollinator species. Historical specimens can also be used not only for temporal analysis of forage use but can help increase sample size for spatial analysis. For example, Gous, Eardley, Johnson, Swanevelder, and Willows-Munro (2021) sampled pollen from museum specimens of multiple species from varying regions in South Africa across large temporal scale to assess the relationship between resource use and geographic range of species. There is similar potential to use herbarium specimens to look at past distributions and morphology of plants to check for plant-pollinator mismatches.
Finally, future advances in sequencing technology could influence how pollen DNA metabarcoding and related methods are used for studying ecological change. New fast, portable, sequencing technologies, such as Oxford Nanopore Technologies’ MinION, could allow for analysis while in the field. Increased throughput of short read sequencing technologies (e.g., Illumina NovaSeq), could make high-throughput analysis of pollen samples faster and more cost-effective. Improvements in accuracy of PacBio HiFi long-read sequencing could make it suitable for DNA metabarcoding and related methods. However, many of these sequencing platforms currently require a very high quality and quantity of DNA. As sequencing technologies improve, they may require lower quantities of DNA making them increasingly useful for pollen DNA analysis, especially of pollen from small insects, single pollen grains, or ancient pollen.
Metagenomics methods depend on reference databases of whole genomes or whole plastid genomes. However, new initiatives, such as the Earth BioGenome Project, which aims to sequence the genomes of most eukaryote species within a decade (Lewin et al., 2018), have the potential to alleviate this limitation in the future. In addition, there are metagenomics methods based on long-read sequencing that could improve the resolution and quantification of pollen identification, without the need for whole genome reference databases. For example, the RevMet (reverse metagenomics) method of Peel et al. (2019) uses long-read nanopore sequences of samples, while the reference database contains low coverage short-read genome skims. This strategy could potentially increase resolution and quantification relative to DNA metabarcoding and could have cost advantages over other metagenomics methods as it does not require assembled genomes for the reference database. An alternative approach is provided by reduced-representation sequencing for metagenomics. For example, the sequencing of restriction fragments (ddRAD) has been used to identify plant community composition from roots in soil samples (Wagemaker et al., 2021), and the same methods could easily be applied to pollen mixtures. Given the rapid advancements in sequencing technology and the increasing availability of reference sequences and genomes, method development for DNA-based identification of pollen should be considered a work in progress; each new study will need to carefully consider the strengths and weaknesses of the methods available or likely to become available during the study.
Conclusions
In recent years, there have been increasing numbers of studies applying pollen DNA metabarcoding and related methods to research questions on global change ecology (Fig. 1). Pollen DNA metabarcoding has provided advantages over other methods in throughput and requiring less taxonomic expertise. The potential applications of pollen DNA metabarcoding and related methods are likely to increase as reference databases improve, methods are assessed against traditional approaches and standardised, and as multi-year datasets are accumulated.
Currently, pollen DNA metabarcoding is complementary to traditional methods, such as microscopic identification of pollen and direct observation of plant-pollinator interactions. This is particularly the case where baseline data exists and has been collected through traditional methods. However, there are cases where samples have been captured and preserved: e.g., museum specimens of pollinators (Gous et al., 2021; Simanonok et al., 2021), slides of pollen collections (Marcel Polling et al., 2022), and these have been analysed through DNA metabarcoding, providing scope to analyse contemporary and historical samples, and assess changes from baseline ecological conditions. As more long-term studies are completed using DNA metabarcoding, these methods could be used to understand recent change. Under these scenarios, pollen DNA metabarcoding and related methods could eventually become more predominant, especially given the benefits in terms of high-throughput for large sample sizes, and in combination with other newer technologies, such as flow cytometry (Kron et al., 2021) and machine-learning classifications (Gonçalves et al., 2016) .
Global ecological change is happening rapidly, and high-throughput methods are essential for getting timely data on changes so that management practices can be assessed and changed as required. Pollen DNA metabarcoding and related methods are important tools for rapid, high-throughput assessment of ecosystem change, providing real-time management recommendations to preserve biodiversity and the evolutionary and ecological process that support it before it is too late.