Material and methods
Study populations and sampling
procedure
A total of 234 pike from 11 populations (N = 8 – 27) were
sampled for this study. The locations of the populations ranged from
63.6 – 54.9 °N, and represented all three spawning ecotypes
(freshwater, anadromous, and resident; for details see Figure 1and Table 1 ). Our initial plan was to include both anadromous
and resident populations throughout the latitudinal range. However, due
to the aforementioned decreases of pike in the Baltic Sea, many of the
resident populations are no longer found in their former spawning
locations, and we were therefore not able to include more resident
populations. Individuals were captured using fyke nets, rod-and-reel
fishing, or electrofishing, and a non-lethal DNA sample (fin clip) was
collected for each individual before they were released back into the
water. The fin clips were placed in separate 1.5 mL Eppendorf tubes with
70 % ethanol, which were stored in a freezer (-20 °C) until the
molecular work was conducted.
Molecular workflow
DNA was extracted from fin tissue with DNeasy blood and tissue kit
(Qiagen, USA), and digested with HF EcoRI (New England Biolabs, USA).
Size-selection, library preparation, sequencing (using either Illumina
NovaSeq 6000 2 x 150 bp or Illumina HiSeq 2500 2 x 125 bp, Table
1 ), demultiplexing and quality control with MultiQC
(Ewels, Magnusson, Lundin, & Käller,
2016) were performed by Science for Life Laboratory (Stockholm,
Sweden). Sequence data for Harfjärden, Okne and Lervik was previously
published (Sunde et al., 2020a) and
retrieved from the NCBI Sequence Read Archive
(Sunde, Yildirim, Tibblin, & Forsman,
2020b). In total this yielded a dataset of ~ 3,200 M
raw reads (mean 13.6 M per sample). Quality filtering of the raw reads
was conducted using Trimmomatic (Bolger,
Lohse, & Usadel, 2014) and process_radtags in the Stacks pipeline
(Catchen, Amores, Hohenlohe, Cresko, &
Postlethwait, 2011; Catchen, Hohenlohe,
Bassham, Amores, & Cresko, 2013). The ~ 3,000 M reads
(95 %) that passed the quality filtering, were further processed with
the integrated approach (Paris, Stevens,
& Catchen, 2017) of the Stacks de novo pipeline, using
parameter settings determined by an initial parameter optimization
(Figure S1-S2 ). This yielded a final dataset of 5993 biallelic
SNPs, for which missing data was imputed using Beagle
(Browning, Zhou, & Browning, 2018)
before it was used in the downstream analyses. For details on the
molecular workflow see section ‘Supplementary methods’ in the
Supplementary materials.
Investigations of neutral genetic
variation
Analysis of neutral genetic diversity and population
structure
The Populations software in the Stacks pipeline was used to obtain
estimates of number of private alleles, observed heterozygosity
(HO ), expected heterozygosity
(HE ), and fixation index (Fis ) for each
population. The same software was used to obtain estimates of the
fixation index (FST ) by
Weir and Cockerham (1984), that was used
to assess pairwise population differentiation.
To assess genetic structuring, the full dataset was analysed using
multiple approaches. To determine the most likely number of genetic
clusters (K ), fastSTRUCTURE (Raj,
Stephens, & Pritchard, 2014) was used. Because fine-scaled
differentiation might be concealed by differentiation on higher levels
(Evanno, Regnaut, & Goudet, 2005), we
also ran the analysis with a subset of only the Baltic Sea populations
(anadromous and resident) to test if further differentiation became
evident. A Principal Component Analysis (PCA) based on pairwise genetic
similarity between individuals (proportion of alleles shared) was
visualized with the prcomp function in RStudio 2 (RStudio Team, 2015)
with R (R Core Team, 2012) . For details about settings see
‘Supplementary methods’ in the supplementary materials.
To determine whether the full dataset was representative of neutrally
evolving diversity, the results were compared to those generated by a
subset ’neutral’ dataset. This ’neutral’ dataset was created by
excluding the 5 % tails of the FST distribution
from the full dataset, which should exclude loci under strong selection
(both divergent and balancing). Patterns of genetic structure indicated
by such a neutral dataset should therefore be reflective of only
stochastic processes. The comparison revealed that the results obtained
for the full and the ’neutral’ datasets were qualitatively similar (not
shown), indicating that the full dataset was representative of neutral
diversity. We therefore chose to proceed with the full dataset for the
comparisons with adaptive genetic variation and structure (see below).
Isolation by distance
The full dataset for only the Baltic Sea samples (anadromous and
resident populations) was used to test for the presence of a correlation
between geographical and genetic distances (FST )
between the populations. The freshwater populations were excluded
because we did not expect them to exhibit an IBD pattern because of
geographical isolation among the lakes. We performed a Mantel test
(Mantel, 1967) in Arlequin
(Fdist; Excoffier & Lischer, 2010), the
significance was tested with 1,000 randomization, and the correlation
was visualized in ggplot2 (Wickham,
2016) in R. An additional Mantel test was performed for only the
anadromous populations to test for potential correlation between
geographical and genetic distances within this ecotype.
Testing for effects of environmental variables
To investigate the roles of salinity (ecotype) and/or latitude for
neutral genetic structure, we used the similarity matrix (for the full
dataset) created for the PCA in a distance-based redundancy analysis
(db-RDA), which is a constrained version of a PCA
(db-RDA; Legendre & Anderson, 1999).
Because potential adaptations to salinity either could have resulted
from differences in the spawning habitats, or reflect differences in
salinity in the foraging areas, we wanted to account also for the
salinity gradient in the Baltic Sea in the analysis. To this end, we
chose to use midrange salinity (mean value of minimum and maximum
salinity experienced during the life-cycle) as a proxy for ecotype
(salinity for each population was obtained from
Bendtsen et al. (2007); see Table
1 ). This resulted in that all freshwater populations were assigned a
value of 0, the resident population the highest value of 12, and the
anadromous values in the intermediate range of 1.5 to 6.0.
To make the variables comparable for the db-RDA, both environmental
variables (latitude and midrange salinity) were normalized (to a
standard deviation of 1; by first subtracting the normal value from the
observed value, and then dividing by the standard deviation of all
observed values) before running the db-RDA. For the normalization we
used 0 (psu) as the normal value for salinity (because pike is of
freshwater origin; Craig, 1996;
Raat, 1988), and the mean latitude as
normal value for the latitude. The db-RDA was run in the vegan package
(Oksanen et al., 2019) in R, and the
statistical significance were assessed with 999 permutations.
Phylogenetic analysis
The evolutionary relationship among the samples was investigated with a
maximum likelihood (ML) based phylogenetic inference in RAxML v8.2.10
(Stamatakis, 2014). This was done by
using the full dataset following the steps described in ’The Hard &
Slow Way’ in the RAxML manual (Stamatakis,
2016), which includes optimizing the parameter settings for number of
rate categories and initial rearrangement. After the optimization, the
final analysis with 200 inferences and 1,000 bootstraps was run using 10
as initial rearrangement, 25 rate categories, the GTRCAT model of
nucleotide substitution, and using the E. lucius genome published
by Rondeau et al. (2014) as outgroup. The
phylogenetic tree was then visualized using Interactive Tree Of Life
(iTOL)(Letunic & Bork, 2019).
Investigations of adaptive genetic
variation
Identification of loci putatively under selection
The full dataset was used in the outlier analyses to search for loci
putatively under selection. To test for locus-specific effects,
populations were introduced as separate groups, and the data was
analysed using multiple approaches, in three different software
(BayeScan, Fdist, and LOSITAN; for details about settings see
Supplementary methods in the supplementary materials). Only loci
identified as outliers by all three software were retained. This was
done because differences in the algorithms and assumptions of the
approaches might lead to different SNPs being identified as outliers;
and by conservative selection of only overlapping ones, it is more
likely that the identified outliers are true positives
(de Villemereuil, Frichot, Bazin,
François, & Gaggiotti, 2014).
To test for correlations between selection and ecotype, we utilized
BayeScEnv (de Villemereuil & Gaggiotti,
2015), which aims at differentiating signals of selection from those of
demographic processes by searching for associations between co-variates
(environmental variables) and allele frequencies. A benefit with
BayeScEnv when searching for outlier loci associated with co-variates is
that the software can simultaneously evaluate the roles of two
environmental variables in a single analysis. Thus, using this software
enabled us to search for outlier loci associated with midrange salinity
after statistically accounting for variation due to latitudinal
differences. For this, we used the same normalized values for the two
environmental variables (midrange salinity and latitude) as in the
db-RDA (for details see subsection ’Testing for effects of environmental
variables’ above).
Analysis of adaptive genetic diversity and structure, and
testing for effects of environmental variables
To compare the effects of neutral and adaptive processes on patterns of
genetic diversity, a subset dataset that should represent adaptive
variation (henceforth referred to as the ‘adaptive dataset’) was also
analyzed. This adaptive dataset was created by selecting the 17 loci
that were identified as outliers by all three software used for the
outlier analyses with population grouping, and it should therefore be
reflective of deterministic processes such as selection. To obtain
estimates of HO , and HE ,
the populations software in Stacks was used. The adaptive dataset was
also analyzed using the same fastSTRUCTURE and db-RDA procedures as used
for the full dataset. For details on procedures see ‘Supplementary
methods´ in the supplementary materials.