Step 4: Bioinformatic analysis to compare sequencing reads
to reference databases
Bioinformatic procedures are required to assign taxonomic
identifications to the sequence reads. There are numerous bioinformatic
pipelines for analysing amplicon sequencing data including but not
limited to QIIME2 (Bolyen et al., 2019), dada2 (Callahan et al., 2016),
VSEARCH (Rognes, Flouri, Nichols, Quince, & Mahe, 2016), USEARCH
(Edgar, 2010), and OBItools (Boyer et al., 2016), as well as custom
pipelines developed specifically for pollen metabarcoding (e.g., Ford
and Jones, 2020). All these pipelines share a few main steps, which are
not necessarily performed in the same order as described here. 1) In
most cases paired-end data is obtained for Illumina sequencing runs. In
many pipelines, the forward and reverse reads are merged, low-quality
reads removed, and low-quality bases are trimmed from reads. Some
paired-end read mergers such as PEAR (Zhang, Kobert, Flouri, &
Stamatakis, 2014) also remove adapter sequences (technical sequences
added during HTS library preparation). 2) Reads are demultiplexed based
on tags used in the metabarcoding primer, or tags that are added to each
sample during HTS library preparations. 3) Reads are dereplicated and
denoised to cluster sequences together which likely differ only by PCR
errors. There are some approaches such as the Amplicon Sequence Variant
(ASV) (Callahan et al., 2016) which have been applied to pollen
(Casanelles‐Abella et al., 2021; Elliott et al., 2020; Wilson et al.,
2021). 4) Taxonomic assignment using reference databases. Finally, 5)
the creation of a Taxon table or mOTU table containing the number of
reads assigned to each taxon per sample. The Taxon table can be further
filtered to remove noise and contaminants based on extraction and PCR
controls.
Limitations, technical issues, and progress
One of the ongoing challenges associated with DNA-based identification
of species mixtures is quantifying the relative abundance of species in
the mixture. The issue of quantification is likely to be relevant for
many applications in assessing global ecological change. Many ecosystem
changes initially present as changes in species abundances rather than
changes in species composition. Solutions to the technical problem of
quantification could differ for pollen relative to other sample types,
although the issues are similar in many ways. DNA metabarcoding is
considered semi-quantitative. Species in high proportions are usually
represented by many sequencing reads, although the relationship often
deviates from the expected 1:1 ratio (Bell et al., 2019; Marcel Polling
et al., 2022). There are several reasons for this deviation. Different
pollen types can have different DNA extraction efficiencies, which can
be improved with method optimisation. Variation occurs among species in
the copy numbers of the DNA barcodes, and this has been well-studied for
microbes, with the possibility to correct for these biases (Kembel, Wu,
Eisen, & Green, 2012; Lamb et al., 2019; Pawluczyk et al., 2015) and a
recent study on mitogenomics of insects has applied corrections for copy
number (L. Garrido-Sanz et al., 2021). An additional source of bias
comes from differences among species in primer binding efficiency
(Pompanon et al., 2012) and biases in DNA polymerase binding
efficiencies towards different nucleotide compositions (Nichols et al.,
2018). These biases can be reduced with careful primer design and PCR
optimisation and can be corrected with a good understanding of the
biases.
Amplification-free methods eliminate the PCR biases and have been shown
in a handful of studies to be more quantitative than DNA metabarcoding.
Whole-genome shotgun (WGS) sequencing has been shown to have improved
quantification for pollen (Bell, Petit, et al., 2021) and other mixtures
of eukaryote species (Lidia Garrido-Sanz, Senar, & Piñol, 2020).
Genome-skimming of organellar DNA from WGS has been shown to be
quantitative for pollen (Lang et al., 2019) and other eukaryote mixtures
(Bista et al., 2018), and quantification can be improved by correcting
for organelle copy number (L. Garrido-Sanz et al., 2021).
Reduced-representation sequencing using endonucleases
(genotyping-by-sequencing) of plant roots has shown within- and
across-species abundances strongly correlate with biomass-based species
abundance (Wagemaker et al., 2021). Reverse Metagenomics, the sequencing
of samples using MinION long reads while reference sequences come from
short read skims, has also been found to be semi-quantitative (Peel et
al., 2019). While most DNA-based detection and identification methods
are semi-quantitative, there is considerable value in the relative read
abundances, which are lost by treating data as presence-absence (Deagle
et al., 2019). Improved quantification is expected to become possible in
the future with an improved understanding of biases.
A related problem is understanding the sensitivity of DNA metabarcoding
and the expected detection limits for species of interest, and the rates
of false positives and false negatives. This issue may be particularly
relevant to biosurveillance and ecosystem monitoring applications, where
researchers and managers may be interested in changes in the presence or
absence of low abundance species, such as a rare species becoming
extinct or early detections of new incursions of non-native invasive
species. Acceptable levels of false positives and false negatives will
differ among applications. For example, for a risk-averse strategy
detecting invasive species, it is important to avoid false negatives. In
contrast, for detecting threatened species, a more risk-averse approach
would avoid false positives. In both cases, a level of confidence is
needed for the detection of a target species. These issues have been
addressed with methods for the eDNA of water samples. For example, eDNA
has been combined with site occupancy models to determine the confidence
of presence/absence results (Dorazio & Erickson, 2018; Schmidt et al.,
2013), and similar methods would be applicable to pollen.
Improved confidence in the presence of a species in a sample can be
obtained by understanding the overall rate of false positives and false
negatives for the study system and method. Researchers can increase
confidence by using field-based and laboratory-based negative controls
and positive controls or mock communities and the use of no-library
negative controls to quantify sequencing mistag rates (Esling,
Lejzerowicz, & Pawlowski, 2015). However, confidence estimates are also
lacking for the classification steps in pollen DNA metabarcoding. There
is an additional need for developing classification programs with more
accurate probabilistic confidence estimates. While this has been
attempted several times, available methods do not provide consistent
results depending on the gene regions and databases used (Edgar, 2018).
Another challenge for pollen DNA metabarcoding and related methods is
the development of reference databases. There are an estimated 450,000
angiosperm species (Pimm & Joppa, 2015), and currently, around 25% of
these have publicly available sequences for standard DNA barcodes (Bell,
Petit, et al., 2021). Reference libraries have been compiled for
standard DNA barcodes for all flowering plants in the UK (Jones,
Twyford, et al., 2021) and Canada (Kuzmina et al., 2017). There is
ongoing work in other countries to develop national databases. There are
fewer references available for non-standard DNA barcodes, plastomes,
genome skims, and assembled genomes (Bell, Petit, et al., 2021; Lang et
al., 2019). Several large-scale projects are in progress to sequence DNA
barcodes, organellar genomes and whole genomes for a large proportion of
global biodiversity (Lewin et al., 2018), and therefore, the
availability of reference sequences is continually improving. Additional
problems occur with the quality of publicly available sequences. While
quality control standards for databases such as BOLD are high, databases
such as GenBank depend on researchers depositing data to conduct their
quality control, and only minimal checks are performed, and many
erroneous sequences have been found (Breitwieser, Pertea, Zimin, &
Salzberg, 2019). For most studies, it will be necessary to develop a
custom, curated database, including sequencing any species in the study
system that do not already have reliable sequences on public databases.
Filtering and subsetting public reference databases to species of
interest (e.g., the regional species pool) can be a helpful step in
classification to help avoid misclassifications to closely related
species (A. Keller et al., 2020). This step is likely to remain useful,
even as reliable public databases become more complete.
The ability to assess global ecological change often relies on the
comparison of contemporary data to historical data. Pollen DNA
metabarcoding methods have only been developed recently, so there are no
equivalent historical datasets. Baseline data from earlier studies based
on other methods may not be directly comparable to pollen DNA
metabarcoding. For example, studies have shown that the networks
assembled through pollen DNA metabarcoding are more connected than those
assembled through observations (Arstingstall et al., 2021; Pornon et
al., 2017), although networks assembled through traditional
identification methods of pollinator-collected pollen may be more like
those assembled through DNA metabarcoding, i.e., more connected (Bosch
et al., 2009). There is potential to obtain equivalent historical data
by analysing historical specimens using DNA sequencing technologies
(Gous et al., 2019; Simanonok et al., 2021). Still, assessments are
needed to see if there are any biases due to degradation over time.
Likewise, pollen in sediments could provide a source of material for
comparison with modern pollen DNA metabarcoding or other DNA-based
methods (Niemeyer et al., 2017). Finally, as we move into the future, it
will be essential to retain and archive specimens for optimum
preservation to be reanalysed and compared to future samples.
To date, there has been little standardisation in the methods used for
sequence-based characterisation of pollen communities in terms of
databases, classifiers, gene regions, or other options. There is also
inconsistency in how the methods are described and results reported in
the literature making comparisons of techniques difficult. While this
allows for greater flexibility and more scope for further method
development, standardisation would facilitate consistent benchmarking of
procedures to improve confidence. Once again, the fields of aquatic eDNA
and microbiomics are more advanced in this regard, with several
recommendations and standardised methods available (Loeza‐Quintana,
Abbott, Heath, Bernatchez, & Hanner, 2020; Yilmaz et al., 2011). There
are many choices on methodology at all stages of a research project,
including: gene region(s), DNA extraction, PCR, sequencing technologies,
and data analysis. Standardisation becomes possible once available
methods have been compared, assessed, and optimised. From this, minimum
sets of standards can be determined for replication, negative and
positive controls, and optimal choices across sampling, sequencing, and
data analysis steps. These standards can then be applied when designing
a project and when assessing papers during peer-review. Some methods for
the various steps involved in pollen DNA metabarcoding have been
compared (Swenson & Gemeinholzer, 2021; Tommasi, Ferrari, Labra,
Galimberti, & Biella, 2021), however, there remains a lot of work to be
done before any general recommendations can be made.
Future research directions
In addition to the work currently in progress to solve the technical
issues discussed in section 4, we have identified several areas where
method development on pollen DNA metabarcoding and related methods could
open new avenues of research. These include method developments for
intraspecific identification, analysis of DNA from ancient pollen,
increased use of specimens available in museums and herbaria, and
increased use of newer sequencing technologies.
Intraspecific identifications (e.g., populations, individuals) using DNA
sequencing of pollen could open new research areas on the role of
pollinators in gene flow, and the effects of habitat fragmentation on
plant and pollinator health and adaptive potential. Recent developments
with eDNA suggest that this is plausible for pollen. For example, eDNA
has been used to investigate intraspecific variation in the
mitochondrial control region of whale sharks in sea water (Sigsgaard et
al., 2016), cytb variation in harbour porpoises (Parsons, Everett,
Dahlheim, & Park, 2018), and microsatellite allele frequencies in
artificial mesocosms of round gobies (Neogobius melanostomus )
(Andres, Sethi, Lodge, & Andres, 2021). Genomic DNA from Late
Pleistocene bears was retrieved from shotgun data showing that it might
be possible to use ancient eDNA for intra-species analysis (Pedersen et
al., 2021). Applying target capture of mitochondrial and nuclear DNA in
eDNA is also a promising avenue of research (Jensen et al., 2021). All
the previous investigations into intraspecific variation from eDNA have
been attempted in aquatic study systems and cave sediments, and, as far
as we know, have not been attempted on plant communities or pollen.
Pollen mixtures might be the ideal candidate system to investigate
intraspecific variation. Because all individual pollen grains in a
mixture represent an individual gametophyte, any method focusing on the
genetic content of a single pollen grain can be used to investigate the
mixture on a population level. While population level inferences might
be possible without sequencing single pollen grains, many population
genomics tools require sequence data of individuals. To overcome
problems with single pollen grain sequencing, such as low template
abundance and uncertainties about whether DNA is coming from inside the
pollen or fragments sticking to the outside, it might be beneficial to
germinate the pollen to make accessing the DNA easier (Jayaprakash,
2018). Current advances in the read output and length in HTS platforms
coupled with the potential to sequence individual pollen grains might
advance molecular pollen analysis beyond species identification. As soon
as the DNA inside individual pollen grains can be accessed, amplified,
and sequenced, the road to a whole new set of research questions can be
opened. For example, applying population genomics to not only count the
number of species, but also quantify how many individual plants a
pollinator has visited.
The high level of preservation of pollen morphology in ancient
sedimentary records, means that pollen is often used to understand past
ecosystems. Coupling these methods with ancient DNA technology could
provide narrower taxonomic identification and more fine scale
understanding of past ecosystems. Although it has been demonstrated that
DNA is present in ancient pollen and can be sequenced (Bennett &
Parducci, 2006; L. Parducci et al., 2005), ancient pollen samples are
difficult to process and there is a high risk of contamination with
exogenous DNA. There are many different approaches for isolating and
cleaning single pollen grains from the abundant pollen usually present
in sediments. These include hand pipetting under a microscope, serial
dilution, flow-assisted cell sorting (flow cytometry), microfluidic
manipulation (Wang & Navin, 2015), flow sorting or micro-manipulation
(Kron & Husband, 2012). Potential method development in this area could
focus on improved efficiency and contamination control. Once these
issues are addressed, there is great promise for ancient pollen analysis
to reveal ecosystem change over large time scales.
Likewise, there is currently an unrealised potential to use herbarium
and museum specimens to answer questions in global change ecology.
Currently, a handful of studies have shown that preserved specimens can
be used to study plant-pollinator interactions. Methods have been
developed for identifying historical pollen samples from bees using
metabarcoding (Gous et al., 2019). Museum specimens can be used to
assess resource use over huge temporal scales. For example, Simanonok et
al. (2021) identified resource use by Bombus affinis over 100
years, and found that pollen richness and diversity had not changed over
time and neither had the abundance of native nor introduced species.
This work helps understand drivers behind the decline of native
pollinator species. Historical specimens can also be used not only for
temporal analysis of forage use but can help increase sample size for
spatial analysis. For example, Gous, Eardley, Johnson, Swanevelder, and
Willows-Munro (2021) sampled pollen from museum specimens of multiple
species from varying regions in South Africa across large temporal scale
to assess the relationship between resource use and geographic range of
species. There is similar potential to use herbarium specimens to look
at past distributions and morphology of plants to check for
plant-pollinator mismatches.
Finally, future advances in sequencing technology could influence how
pollen DNA metabarcoding and related methods are used for studying
ecological change. New fast, portable, sequencing technologies, such as
Oxford Nanopore Technologies’ MinION, could allow for analysis while in
the field. Increased throughput of short read sequencing technologies
(e.g., Illumina NovaSeq), could make high-throughput analysis of pollen
samples faster and more cost-effective. Improvements in accuracy of
PacBio HiFi long-read sequencing could make it suitable for DNA
metabarcoding and related methods. However, many of these sequencing
platforms currently require a very high quality and quantity of DNA. As
sequencing technologies improve, they may require lower quantities of
DNA making them increasingly useful for pollen DNA analysis, especially
of pollen from small insects, single pollen grains, or ancient pollen.
Metagenomics methods depend on reference databases of whole genomes or
whole plastid genomes. However, new initiatives, such as the Earth
BioGenome Project, which aims to sequence the genomes of most eukaryote
species within a decade (Lewin et al., 2018), have the potential to
alleviate this limitation in the future. In addition, there are
metagenomics methods based on long-read sequencing that could improve
the resolution and quantification of pollen identification, without the
need for whole genome reference databases. For example, the RevMet
(reverse metagenomics) method of Peel et al. (2019) uses long-read
nanopore sequences of samples, while the reference database contains low
coverage short-read genome skims. This strategy could potentially
increase resolution and quantification relative to DNA metabarcoding and
could have cost advantages over other metagenomics methods as it does
not require assembled genomes for the reference database. An alternative
approach is provided by reduced-representation sequencing for
metagenomics. For example, the sequencing of restriction fragments
(ddRAD) has been used to identify plant community composition from roots
in soil samples (Wagemaker et al., 2021), and the same methods could
easily be applied to pollen mixtures. Given the rapid advancements in
sequencing technology and the increasing availability of reference
sequences and genomes, method development for DNA-based identification
of pollen should be considered a work in progress; each new study will
need to carefully consider the strengths and weaknesses of the methods
available or likely to become available during the study.
Conclusions
In recent years, there have been increasing numbers of studies applying
pollen DNA metabarcoding and related methods to research questions on
global change ecology (Fig. 1). Pollen DNA metabarcoding has provided
advantages over other methods in throughput and requiring less taxonomic
expertise. The potential applications of pollen DNA metabarcoding and
related methods are likely to increase as reference databases improve,
methods are assessed against traditional approaches and standardised,
and as multi-year datasets are accumulated.
Currently, pollen DNA metabarcoding is complementary to traditional
methods, such as microscopic identification of pollen and direct
observation of plant-pollinator interactions. This is particularly the
case where baseline data exists and has been collected through
traditional methods. However, there are cases where samples have been
captured and preserved: e.g., museum specimens of pollinators (Gous et
al., 2021; Simanonok et al., 2021), slides of pollen collections (Marcel
Polling et al., 2022), and these have been analysed through DNA
metabarcoding, providing scope to analyse contemporary and historical
samples, and assess changes from baseline ecological conditions. As more
long-term studies are completed using DNA metabarcoding, these methods
could be used to understand recent change. Under these scenarios, pollen
DNA metabarcoding and related methods could eventually become more
predominant, especially given the benefits in terms of high-throughput
for large sample sizes, and in combination with other newer
technologies, such as flow cytometry (Kron et al., 2021) and
machine-learning classifications (Gonçalves et al., 2016) .
Global ecological change is happening rapidly, and high-throughput
methods are essential for getting timely data on changes so that
management practices can be assessed and changed as required. Pollen DNA
metabarcoding and related methods are important tools for rapid,
high-throughput assessment of ecosystem change, providing real-time
management recommendations to preserve biodiversity and the evolutionary
and ecological process that support it before it is too late.