Title: Uniting genetic and geographic databases to understand the
relationship between latitude and population demography
Running Title: Aggregating geography and genetic databases
Frank T. Burbrink
Department of Herpetology
American Museum of Natural History
Central Park West at 79th Street
New York, NY 10024-5192
fburbrink@amnh.org
Keywords: phylogatR, phylogeography, Tajima’s D, latitude
Abstract:
Conducting large-scale phylogeographic studies to understand processes
affecting population structure and genetic diversity across multiple
species is difficult because the key genetic (NCBI) and spatial (GBIF)
repositories are disconnected. In this issue of Molecular Ecology
Resources , Pelletier et al. (2022) demonstrate the power of connecting
these in the program phylogatR. This program assembled 87,852
species and 102,268 sequence alignments in a taxonomic hierarchy,
yielding multiple sequence alignments per species, mainly for animals
(88%), composed mostly of mtDNA data. The authors discuss several
caveats with these alignments and provide flags identifying particular
problems associating locality and genetic data with certain taxa (e.g.,
multiple localities per individuals). They provide a test that
nucleotide diversity should increase with area, but find a significant
relationship in only 32% of taxa with no clear taxonomic or ecological
factors accounting for this. To examine the potential of this program, I
tested the idea that the degree of population expansion should increase
with latitude given potential environmental stability in the tropics and
instability in temperate regions. In under two hours, I downloaded all
squamates (lizards and snakes) and regressed Tajima’s D on
latitude and found a weak but significant negative relationship,
indicating a potential association between latitude and population
expansion. The phylogatR database is a powerful resource for
researchers wanting to test the relationship between genetic diversity
and some aspect of space or environment.
It was not known that phylogenetic analysis of population genetic data
would show geographic structure when
Avise et al. (1979)
introduced the field of phylogeography, so named eight years later
(Avise et al., 1987). Since
then, the field has grown from using single gene fragments to whole
genomes resulting in more than 22,000 publications. This seemingly
simple relationship between geography and genetic variation has provided
the foundation for studying speciation, species delimitation, hybrid
zone dynamics, adaptation, conservation genetics, community assembly,
historical demography, and climate change response to name a few
(Frank
T. Burbrink et al., 2016; F. T. Burbrink & Ruane, 2021; Carnaval et
al., 2009; Dapporto et al., 2009; Dufresnes et al., 2020; Hewitt, 2001;
Overcast et al., 2019; Rissler & Smith, 2010; Satler & Carstens, 2017;
Shaffer et al., 2004; Smith et al., 2011; Soltis et al., 2006).
As with many burgeoning fields, there is often little consideration of
how to make datasets accessible for future researchers addressing more
comprehensive questions under a common framework. For example, it is
common to examine how shared environments or barriers affect population
structure across communities of species, or test if range size or
latitude are correlated with genetic diversity across taxa
(Hickerson et al.,
2006; Myers et al., 2019; Smith, et al., 2017). However, addressing
these types of questions using existing data requires researchers to
assemble large databases manually. Genetic and geographic databases used
to store this information like NCBI Genbank (National Center for
Biotechnology Information) and GBIF (Global Biodiversity Information
Facility) are disconnected, and often of limited general use for
conducting multitaxon studies. In this issue of Molecular
Ecology , Pelletier et al.
(2022) have automated the process of connecting geography to DNA
sequences via the phylogatR (phylogeographic data aggregation and
repurposing) database.
The phylogatR database has assembled 87,852 species and 102,268
sequence alignment and associated spatial data. The database represents
mostly animals (88%) distantly followed by plants (9%) generated from
NCBI Genbank, BOLD (Barcode of Life Database), and GBIF. This program is
automatically updated monthly for new entries. The alignments produced
by phylogatR are generated by MAFFT v7
(Katoh & Standley, 2013),
checked for alignment and gap issues, and are ready to use for analyses.
To note, the authors have developed a system for flagging sequences with
potential problems, such as multiple unique geographic coordinates
referenced to a single sequenced sample or changes in taxonomy.
Pelletier et al. (2022) have provided several tutorials that explain how
to use the database, which should be useful for teachers conducting
workshops
The database now has 2.6 million records representing 1988 genes, with
most species having only 1.2 genes and 25.8 sequences per alignment. As
expected, most of the sequence alignments here are represented by
mitochondrial and chloroplast DNA. To provide a test of the data
collected in phylogatR, Pelletier et al. (2022) ask if range size
predicts nucleotide diversity (π), an old but important question in
population genetics (Wright,
1943), but now using 80,000 species and over 2 million sequences.
Nucleotide diversity was estimated in Pegas(Paradis, 2010) and was
regressed against geographic area calculated from the associated
georeferenced data. The authors discovered only 58 geographic outliers
for taxa with large ranges and 23 π outliers, mostly due to mixed-gene
alignments or individuals missing overlapping sequences. Interestingly,
a majority of groups (68%) showed no significant relationship between
area and nucleotide diversity. There seemed to be no taxonomic or
general ecological trend among those groups showing a significant
relationship. Of course many other factors might contribute to genetic
diversity and this bears further exploration as the authors suggest.
To test drive phylogatR , I conducted a study on squamates
(lizards and snakes) to address a key question in evolutionary ecology:
are tropical regions more stable than temperate regions and does this
affect biodiversity
(Dobzhansky, 1950)?
Stability of taxa in the tropics relative to those in temperate regions
with greater environmental fluctuation over time should show evidence of
greater demographic expansions at higher latitudes
(Lessa et al., 2003;
Whorley et al., 2004). I estimated Tajima’s D for each species
and regressed these against latitude using R (R Core Team 2020).
Negative values of D suggest population expansion from a
bottleneck or a selective sweep, whereas values close to zero indicate
neutrality (Stajich &
Hahn, 2005; Tajima, 1989). I also examined the relationship between
area and π. Because a majority of species only have mtDNA, I kept the
longest fragment of mtDNA with the most individuals per species. This
yielded a dataset of 418 species with an average of 13.59
individuals/species. I found a significant relationship between Tajima’sD and latitude (P = 0.01), though the effect size was
small (r 2 = 0.012; Fig.1). Because low sampling
can affect correct estimation of Tajima’s D , I filtered the
dataset to only include taxa with > 10 individuals (n =
143); this also generated a significant relationship with a weak effect
(P = 0.037; r2 = 0.023). The prediction that
populations may be more stable in the tropics and that the magnitude of
population expansion increases with latitude holds. However, the effect
is weak; negative D might also be associated with selective
sweeps, and relying on a single locus may not provide the strongest
inference of population-level processes
(Burbrink & Ruane, 2021).
Interestingly, the mean value of D across all squamates was close
to zero (-0.67), suggesting a strong role for neutrality. Similar to
Pelletier et al. (2022), I found no relationship between area and π
(P = 0.19 – 0.36), though the geographic extent here for some
taxa may be inaccurate. This study was completed in ~2
hours on a MacBook Pro.
Because phylogatR is a big-data aggregator, it is likely that more
fine-scale problems with individual species alignments and georeferenced
data are present. In the squamate study, I used an outlier detector
(Grubbs test) for π and D and found two species (0.4%) with
either individuals missing 95% of their data or lack overlap among most
gene fragments. These kinds of problems could be identified prior to
analyses with simple scripts that detect missing data beyond some user
input threshold or sequence mismatch. I found geographic outliers were
caused by taxa that had been introduced well beyond their natural range
(e.g.., Hemidactylus geckos). This requires the end user to know
about the natural history of their target study organisms and assess if
this is a problem for their particular study design.
The program phylogatR represents a major leap forward for
aggregating all of those phylogeographic datasets accumulating since
Avise et al. (1979). The database is easy to use, is a major time saver,
and the caveats are clear. I envision some future version of this that
scrapes genome-scale data now also accumulating at a massive rate. In
the meantime, the current version can facilitate the next generation of
comparative ecological and community level analyses of phylogeographic
patterns and processes.