Materials and methods
Global sample collection
Due in part to its extensive distribution and phenotypic diversity,Plantago major has a complex taxonomic history with over 50 lower
taxonomic divisions and synonyms having been proposed (Pilger, 1937;
Mølgaard, 1976; Peruzzi & Passalacqua, 2003; POWO 2019). At least three
subspecies are recognized in Europe, two of which, P. majorsubsp. major and P. major subsp. intermedia(Gilib.) Lange, have been widely studied and even recognized as separate
species (P. major and P. intermedia DC.; Morgan-Richards
& Wolff, 1999; POWO, 2019). The full extent of the geographic ranges
occupied by each subspecies, and overlap between them, is poorly known,
particularly outside of Northern Europe. However, we focused our
sampling on P. major subsp. major which is thought to be
more widely distributed and tolerant of a wider range of environmental
variation and anthropogenic disturbance (Mølgaard, 1976). Plantago
major subsp. major can be discriminated from P. majorsubsp. intermedia based on the number of seeds per capsule
(Mølgaard, 1976; van Dijk, 1984; Wolff, 1991). Only specimens fitting
these characters were selected for our study.
Fifty populations of naturally occurring Plantago majorsubsp. major plants were sampled worldwide, including 22
populations from across the native range (Europe, Asia) and 28
populations from the introduced range (North and South America, Iceland,
Greenland, Hawaiʻi, Australia, New Zealand, and Africa) (Figure
2 ; Table 1 ). Six to ten individuals were sampled from each
population, amounting to 385 individuals in total. A herbarium voucher
was sampled from each population to confirm species identity and
deposited in Herbarium C at the Natural History Museum of Denmark,
University of Copenhagen, in Copenhagen, Denmark. Herbarium vouchers
were not obtainable for Peru; however digital photos were taken in lieu.
DNA extractions
Leaf tissue was pulverized with ceramic beads in a Qiagen TissueLyser®
(Qiagen, Germany) and DNA extractions were prepared using a modified
Qiagen DNEasy® Mini Plant Kit (Qiagen, Germany) or a CTAB protocol
(conducted by ADNid, Montpellier, France). DNEasy Mini Plant Kit
extractions were modified by adding 50 μL of proteinase K to the cell
lysis solution after 10 min incubation at 65 °C, followed by 2 hours of
incubation at 45 °C. Double extractions were made for each sample, then
pooled, to ensure the quantity and quality of genomic DNA was sufficient
for genotyping-by-sequencing (GBS).
Sequencing
In total, DNA extracts from 385 individuals were sequenced
(Table 1 ). GBS library preparation was performed at the Genomic
Diversity Facility at Cornell University, following Elshire et al.
(2011). DNA from each sample was digested using the restriction enzyme
PstI (CTGCA^G), and both a sample-specific barcoded adapter and a
common adapter were ligated to the sticky ends of fragments. Samples
were pooled and fragment libraries cleaned using a QIA-quick PCR
purification kit (Qiagen, USA). Libraries were amplified using an
18-cycle PCR with long primers complementary to the barcoded and common
adapters, purified again using QIAquick, and quantified using a
PicoGreen assay (Molecular Probes, Carlsbad, CA, USA). Samples were run
on seven different plates (96-wells), and one blank was included per
plate. Plates were run on lanes of a 100-basepair single-end Illumina
HiSeq 2000, at the Cornell Core Laboratories Center (Cornell University,
New York, USA).
Assembly of pseudo-reference genome
(catalogue)
stacks, a software pipeline designed for restriction-enzyme
based data for organisms without a reference genome, was used to
generate a catalogue or pseudo-reference genome. The sequencing reads
for 392 samples (385 samples plus a blank from each of the seven flow
batches of samples) were demultiplexed using the process_radtags
function in stacks v.1.45 (Catchen et al., 2013). The
demultiplexing process was run in an iterative manner, with the
discarded reads at each step being fed as input to the next
process_radtags step with a smaller barcode size, to retain the maximum
number of reads per sample. During the demultiplexing process, we
rescued barcodes (using the -r option), and retained reads that did not
match any barcode of the given length in a separate file. The parameter
–adapter_1 was used to identify common adapter sequence, the maximum
adapter mismatch was set to 2, while we required a perfect match on the
barcodes.
We selected 20 samples with the maximum number of reads, while
maximizing the number of sampling locations spanned, to build a
reference catalogue using Cstacks in stacks v.1.45
(see Table 1 ). These samples were processed using
Ustacks, with the developing algorithm enabled (-d option),
while retaining unused reads (-r option) and requiring a minimum of
three reads to build a stack (-m 3). The resulting output was used to
run Cstacks with four mismatches required between different
loci (-n 4) when building the catalogue. We also allowed gapped
alignment between the loci, with a maximum gap length of four
(–gapped, –max_gaps 4). This results in a minimum Hamming
distance of 4 between any pair of loci in the reference catalogue. From
the loci identified in the reference catalogue, we created a
pseudo-reference genome by collapsing these loci into 10 chromosomes,
with a string of 150 Ns inserted between consecutive loci.
Mapping to the pseudo-reference genome
(catalogue)
Raw reads were demultiplexed using the demultiplexer function in
gbsx v.1.3 (Herten et al., 2015), allowing for one mismatch in
the enzyme, and one mismatch in the barcode (-me 1 and -mb 1).
Demultiplexing with gbsx retained a larger fraction of reads
than stacks, which in turn allowed us to obtain a larger set of
variants for downstream analyses. Therefore, the demultiplexed reads
obtained from gbsx were used to map reads to the catalogue
assembled in Cstacks. The mapping was performed using the
paleomix pipeline, which was run with default parameters
(Schubert et al., 2014). As part of the paleomix pipeline, the
reads were first filtered for adapter sequences using
adapterremoval2 v. 2.2.0 (Schubert et al., 2016), and these
trimmed sequences were mapped against the pseudo-reference genome using
the backtrack algorithm in bwa v. 0.7.15 (Li & Durbin, 2009).
The blanks from each of the seven plates also were included to ensure
that they did not map to the pseudo-reference genome. Blanks were then
excluded from our downstream analyses.
SNP-calling and genotype
likelihoods
The alignments generated by bwa were used to identify SNPs in
our samples. Since GBS data inherently has very high variance in terms
of coverage of loci and depth of coverage among and within loci, there
is high uncertainty associated with the calling of genotypes at variant
sites. In order to propagate this uncertainty to downstream analyses,
genotype likelihood-based methods were used, which allow us to account
for this uncertainty in our analyses (Nielsen et al., 2013). The SNP
locations and genotype likelihoods for the samples at these locations
were computed using angsd v. 0.921 (Korneliussen et al., 2014).
A total of 385 samples were used to identify variant positions (we
excluded blanks in all further analyses). As a pre-processing step, we
assessed the distribution of the depth of coverage of a randomly chosen
subset of 100 samples to assess the cut-off for the maximum number of
reads that can cover a single position. We discarded reads that mapped
to multiple positions and low-quality bases and reads (quality score
less than 20). Using the distribution of depths obtained from
angsd, we chose a maximum average depth cut-off of 70 X
coverage per sample. Variant discovery and computation of genotype
likelihoods was performed using angsd with the same parameters
as above, but additionally a depth cut-off of 70 per sample (-maxDepth
NumberOfSamples*70), a minimum of 50% of the samples must have at least
one read covering the site (-minInd NumberOfSamples*0.5), a minimum SNP
p-value of 10-6 (-SNP_pval 1e-6), and a penalty for
poorly aligning reads that have a lot of mismatches with the reference
locus (-C 50). The genotype likelihoods were calculated using the model
described in samtools (-GL 1; v. 1.4; Li, 2011).
Phylogenomic inference
Pairwise genetic distances between individuals were examined using
ngsdist (Vieira, Lassalle, Korneliussen, & Fumagalli, 2015), a
distance-based phylogenetic inference method that takes
genotype-likelihoods into account and uses the genotype posterior
probabilities estimated in angsd. Pairwise genetic distances
among the global samples were computed using SNPs that had a minor
allele frequency greater than 10% in the sample, resulting in 2807
total sites. In addition, for each pair of samples, all sites that did
not have any reads covering them in either of the samples were
discarded. Pairwise distances were visualized as multidimensional
scaling (MDS) plots using the cmdscale function in base R (v
3.6.2) to compute the first six coordinates.
Admixture analysis
To identify population structure and identify patterns of admixture
among our samples an individual-based assignment test was performed
using ngsadmix (Skotte et al., 2013; Fumagalli et al., 2014)
using 7594 SNPs. ngsadmix is a maximum likelihood method that
uses the genotype likelihood data obtained from angsd. Our
analyses were run with the number of ancestral populations, K ,
set from two to 12 (K =2 to K =12). For K =2 toK =8, analyses were replicated 200 times, and for K =9 toK =12 the analyses were replicated 500 times to ensure convergence
to the global maximum. The replicate with the highest log-likelihood
among replicates was chosen for each K . A test statistic for
determining the optimal K value is not an available option in
ngsadmix. Therefore, results for each value of K were
reviewed and compared with results from the MDS plots to select aK value that had the most biological relevance based on the
genomics of the species being studied.
Inference of population splits and migration events
treemix v. 1.13 (Pickrell and Pritchard, 2012) was used to
visualize how well the relationships can be represented by a bifurcating
tree using population allele frequencies. An in-house script in
python was written to convert Beagle files generated in
angsd to treemix input files (See Supplementary
Materials). In addition to admixture, treemix also provides
some information about the directionality of geneflow (Patterson et al,
2012). Trees were rooted (-root) using populations from Yukon, Japan,
and South Korea (based on output from admixture analyses suggesting that
these populations are outgroups to the others), and we estimated the
likelihood of graphs with 0 – 4 migration events added in order to
visualize the proportion and direction of geneflow events between the 50
sampled populations.