Introduction
Rapid evolution—or evolution on the scale of tens of generations—is now a well-established cause of genetic change in populations (Colautti & Lau, 2015; Ellner et al., 2011; Hendry & Kinnison, 1999). Numerous studies have described the results of rapid evolution across diverse organisms (Campbell-Staton et al., 2017, 2021; Lee, 2002; Yin et al., 2021), as well as how rapid genetic change has been driven by different evolutionary processes (Barrett et al., 2019; Hof et al., 2016; Kardos et al., 2018; Lamichhaney et al., 2018). While rapid genetic adaptation is often the focus of these studies, both theoretical and empirical methods have demonstrated the strong effect of genetic drift on allele frequencies, even over a handful of generations, and is particularly apparent in bottlenecked populations with significant founder effects (Allendorf, 1986; Kardos et al., 2018). With this study, we show the genome-wide outcomes of a documented population bottleneck originating from a single introduction event. We also show a subsequent response to selection imposed by the introduction into a radically different environment.
Introduced and invasive species make for excellent model systems to study rapid evolution given the nature of most introductions—e.g., small founding populations, novel environments, or multiple progenitor sources (Kolbe et al., 2004; North et al., 2021; Stern & Lee, 2020). Because of these features, species introductions provide useful templates to answer questions about the evolutionary mechanisms that determine if organisms are successful in responding to new environmental conditions. Species introductions can similarly serve as model systems for studying the potential for rapid evolution in threatened and endangered species, or less tractable populations, such as those that are long-lived (Lee, 2002; North et al., 2021).
Populations introduced into novel environments may experience rapid evolution because of different agents of genetic change. On one hand, the often-novel nature of recipient ecosystems—e.g., new pathogens, prey, and predators or differences in the availability of fundamental resources—is expected to result in rapid genetic adaptation (North et al., 2021). However, introduced populations are most often founded from far fewer individuals than the progenitor population and these introductions constitute a simultaneous demographic and genetic bottleneck even with multiple founder events (Allendorf et al., 2022). While a reduction in founding individuals relative to the source population will always result in demographic bottleneck, the effects of a genetic bottleneck are determined by the effective population size of the progenitor population (Waples, 2022). That is, other factors being the same, introduced populations that are founded from large, genetically diverse progenitor populations are expected to have a less extreme genetic bottleneck relative to populations founded from progenitors that are less genetically diverse and have smaller effective population sizes. Moreover, the genetic consequences (e.g., the magnitude of the reduction of genetic diversity and the likelihood of subsequent inbreeding) of the bottleneck will influence the likelihood of persistence as a function of rapid genetic adaptation.
In introduced populations, both genetic drift and natural selection can result in reduced genetic diversity relative to their progenitors. While it remains challenging, separating the not always mutually exclusive effects of genetic drift and genetic adaptation has become more feasible with whole-genome sequencing. The ability to contrast local with genome-wide patterns of allelic change provides a useful template with which to disentangle genetic adaption and genetic drift since the effects of drift are expected to occur more evenly across the genome while the effects of selection are expected to manifest within local genomic regions.
Salmon, trout, and their allies (Family Salmonidae) are among the most commonly introduced species globally because of their desirability as a commercial commodity and recreational species (Lowe et al., 2000). One of the iconic introductions of salmonids globally is the purposeful introduction of Pacific salmon and trout (Oncorhynchus spp.) to the Laurentian Great Lakes (hereafter Great Lakes) where they now make up the most popular targets of a $7 billion per year fishery (Melstrom & Lupi, 2013; Mills et al., 1994; Parsons, 1973). While the majority of salmon species introduced to the Great Lakes were planted with intent of new economic and recreational enterprises—with numerous stocking efforts made to drive those introductions—pink salmon (O. gorbuscha) were introduced in a single event in 1956 as the result of a single accidental introduction (Kwain & Lawrie, 1981; Schumacher & Eddy, 1960). The progeny of an odd-year spawning stock (1955, discussed below), collected from the Lakelse River in British Columbia, Canada, were intended for introduction into the Canadian Arctic (Baffin Bay, CA). However, thousands of juveniles were accidentally released in Lake Superior as hatched fry (N ≈ 21,000, a subset of more than 750,000 progeny from the Artic introduction program made from ~500 dame and sire pairs) in the Current River in Thunder Bay, Ontario (Gharrett & Thomason, 1987). Pink salmon spread rapidly throughout the Great Lakes with individuals being observed in every Great Lake, and populations growing to the 10s of thousands and potentially 100s of thousands, within 12 generations of the introduction (Kwain & Lawrie, 1981). This spread was despite what is presumed to be a strong population bottleneck at introduction—a founding population estimated on the order of hundreds of individuals—due to the very low survival of fry to adults (~2%) and the likelihood of a significant portion of the population straying to new habitats (~10%) (Gharrett & Thomason, 1987; Kwain & Lawrie, 1981; Quinn, 2018).
The introduction of pink salmon to the Great Lakes is an excellent system to study the combined effects of genetic drift and natural selection on rapid evolution because of their small founding population size, the lack of additional propagule pressure (no subsequent gene flow), the radically different environment they were introduced to in the Great Lakes, and the rise of novel age-at-maturity life histories. Pink salmon are obligate anadromous fish in their native range—meaning they must spend most of their pre-spawning life, about one-and-half of their two years, in a marine environment. This life history strategy is largely unique in the genus Oncorhynchus, where pink salmon and their sister species, chum salmon (O. keta), migrate almost immediately to marine environments while the other anadromous species all maintain non-migratory, freshwater ecotypes or can have long periods of freshwater residency. Pink salmon are also unique within the genus in that they have a fixed two-year life cycle in their native range (i.e., time from hatching to death is 2 years for all individuals and other age-at-maturity phenotypes are exceedingly rare (Anas, 1959; Quinn, 2018; Turner & Bilton, 1968)). As a result of this feature, lineages of odd- and even- year fish have diverged, such that a river may have distinct populations spawning in the same habitat in both even and odd numbered years. Because of this lineage divergence, genetic analyses indicate pink salmon are more related to one another based on allochrony, rather than geography—e.g., all odd-year fish are more related to one another than they are to an even year fish in the exact same habitat—whereas geography is the dominant feature structuring populations in most other salmonids (Christensen et al., 2021; Fraser et al., 2011; Seeb et al., 2014; Tarpey et al., 2017). However, this fixed life history in pink salmon has eroded in the Great Lakes, with fish maturing at 1-, 2, and 3-years old, leading to the establishment of odd- and even-year spawning groups in the Great Lakes (despite only an odd-year progenitor) that appear to maintain gene flow across years (Bagdovitz et al., 1986; Gharrett & Thomason, 1987; Kwain & Chappel, 1978; Wagner & Stauffer, 1980). Given the presumed bottleneck at introduction, the novel recipient environment, and the rise of novel age-at-maturity life histories, pink salmon in the Great Lakes are expected to have rapidly evolved via multiple evolutionary forces.
Our study aimed to characterize and disentangle genetic drift and genetic adaptation of pink salmon introduced to the Great Lakes. We sequenced the genomes of pink salmon from the Great Lakes and their progenitor population to answer the following questions: i) what were the magnitude of the population and genetic bottlenecks at introduction and their corresponding reduction of genetic diversity?, and ii) what loci are associated with putative rapid genetic adaptation to the Great Lakes? We employed medium coverage (~14x) whole genome-resequencing and found evidence for a severe genetic bottleneck at introduction along with associated reductions in genetic diversity, as well as multiple genomic regions that may have responded to selection in the novel ecosystem.
Methods
Sample sites and sample collection
Pink salmon were collected from the Lakelse River, British Columbia, Canada (Figure 1a) in the fall of 2006 and 2007 as part of an earlier population genetics study (Beacham et al., 2012). The odd-year spawning population from the Lakelse River was the sole progenitor population for pink salmon introduced to the Great Lakes (Gharrett & Thomason, 1987; Schumacher & Eddy, 1960). Lake Superior tributaries were sampled in September and October of 2018 and 2019 for even- and odd-year spawning fish, respectively. Fish collected in Lake Superior were from the Steel River on the north shore of Lake Superior (Figure 1b,c), one of the largest pink salmon spawning groups in the Great Lakes. The Steel River is 170 km east of the original site of introduction, the Current River (Kwain, 1982; Kwain & Lawrie, 1981). For Great Lakes fish, biometric data (length, weight, phenotypic sex) and a portion of the dorsal fin were collected for subsequent aging. Caudal fin tissue was collected for DNA extraction. In their native range, pink salmon from the Lakelse River form two distinct odd- and even-year populations without gene flow, however in the Great Lakes, fish spawning in odd- and even-years are maintaining gene flow across year classes (Figure S1). As such, we refer to all the groups in our study as sample groups, despite the strong population divergence between the even- and odd-year samples from the same location in the native range. Hereafter, the native range sample groups will be referred to by their year of spawning: British Columbia odd (progenitor) and British Columbia even, which we shorten to BC odd and BC even. We refer to the introduced two-year old sample groups as Great Lakes odd and Great Lakes even, which we shorten to GL odd and GL even. We also sampled 3-yr old pink salmon (a life history strategy unique to the Great Lakes), which we shorten to GL odd 3 (Table 1). It is important to note that a 3-yr old pink salmon that spawns in an odd-year would have parents that spawned in an even-year. These 3-yr old fish are the primary way that gene flow occurs between odd and even years in the Great Lakes and 3-yr old fish are present in both odd and even years though we did not sample any in an even year.
Pink salmon aging
Dorsal fins from all sampled individuals were dried, separated, set in epoxy, and read according to the protocols outlined in Koch & Quist (Koch & Quist, 2007) and Little et al. (Little et al., 2012). To age fish, cross sections of dorsal fin rays set in epoxy and were read at 4.5x power on an Olympus SZ61 digital microscope. Samples were aged by counting annuli in fin rays. Photos were taken of each cross section using the digital microscope and associated IFINITY CAPTURE (v. 6.3.2) software. These photos were also sent to an outside reviewer to confirm accuracy of age determination. Samples from the native range did not include aging structures but were all safely assumed to be 2-years old due to their fixed age-at-maturity, which is the convention for all pink salmon populations in the native range (Gharrett & Thomason, 1987; Quinn, 2018).
DNA extraction, library preparation, and sequencing
Tissue samples collected from British Columbia were extracted using Promega Wizard extraction kits and protocols. Sample aliquots were dried down and stored frozen as pellets in population-specific 96-well plates. These aliquots were sent to Purdue University in 2019, where they were brought into solution with 10 mg of ultrapure water. Tissue samples collected from Great Lakes fish were extracted using Qiagen DNeasy blood and tissue kits using the standard protocol for tissue. Sample concentrations for all samples were determined using a Qubit 4 and the high-sensitivity dsDNA quantification assay.
For genomic sequencing, thirty samples each from BC odd and BC even were randomly chosen without respect to sex as it was unknown for those samples. For the verified 2-year old fish from the Great Lakes odd and even sample groups (2018 and 2019), thirty fish from each year were randomly chosen with an even sex ratio of 15 males to 15 females. Additionally, 15 fish from the GL odd 3yr sample group were included for sequencing (Table 1). Libraries were prepared at the Purdue Genomics core using Illumina DNA Prep kit and protocols with a 1/5 reaction dilution and a set of Illumina DNA Prep compatible amplification primers to add standard Illumina 384-well-design unique dual indexes. One sample failed after library preparation such that only 29 BC odd samples were sequenced. All 134 samples were transported to the Indiana University School of Medicine Genomics Core where they were sequenced using 2x150 base pair paired-end chemistry on an Illumina NovaSeq 6000 S4 for 300 cycles in two separate runs; all samples were included in each of the two runs.
Bioinformatics, variant calling, and filtering
Bioinformatic analyses were conducted using the Purdue Rosen Center for Advanced Computing (McCartney et al., 2014). After sequencing, there was an average of 285.43 million reads per sample (~14.12× coverage of the 2.7 Gb genome, Table S1). Raw reads were trimmed of adapters, and low-quality reads removed using Trimmomatic v. 0.39 with the clipping command LEADING:20 TRAILING:20 MINLEN:30 (Bolger et al., 2014). For SNP calling, we used the Genome Analysis Toolkit v. 4.2.2 (GATK) germline short variant discovery pipeline with best practices unless otherwise noted (Van der Auwera & O’Connor, 2020). First, cleaned reads were aligned to the OgorEven_v1.0 even year pink salmon assembly (GCF_021184085.1) using bwa mem v. 07.17 with the -M flag to mark shorter splits as secondary to avoid aligning to multiple sites as paralogs are common in salmonid genomes (Christensen et al., 2021; Li, 2013). While chromosomal-level assemblies exist for both odd and even-year lineages, the even-year assembly for pink salmon is the reference assembly for the species; it is a much more complete assembly, was built using higher quality sequencing resources (i.e., Hi-C), and is the only assembly that contains annotation information (Christensen et al., 2021). The BAM files were merged into single files using Picard v. 2.20.8 AddOrReplaceReadGroups and MergeSamFiles programs (Piccard Tools, 2020) because reads were generated from two different sequencing runs (though all samples were included in each run). We then marked duplicate reads using GATK’s MarkDuplicatesSpark to generate appropriate BAM files for variant calling.
Variants were called using HaplotypeCaller (in GVCF mode), consolidated with GenomicsDBImport, and then jointly genotyped using GenotypeGVCFs in GATK. All steps were preformed jointly for all sample groups listed above, and the resulting pre-filtering file contained 24,071,073 SNPs. To filter SNPs, we first used a GATK recommended hard filter thresholds (see SI Methods) and all filtering was conducted using VCFtools v. 0.1.16 (Danecek et al., 2011). After the hard filtered loci were removed, samples were filtered in the following order: remove non-biallelic sites, remove individuals with >20% missingness (one individual from the GL even group was removed), remove sites with >20% missingness, remove sites with a minor allele frequency <0.05. Filtering for minor allele frequency was executed with both a joint vcf file with for all individuals (across all sample groups) (5,289,687 total SNPs remaining), as well as for sample group-specific vcf files. Hereafter, these vcfs are referred to as the joint vcf or sample group-specific vcfs, respectively (Table S2 includes the full array of differently filtered vcf files used in this study and the corresponding analyses for which they were used). For downstream analyses, our default approach was to use the joint vcf unless otherwise noted (e.g certain estimators can be biased if low frequency sites are removed, so we used the sample group-specific vcfs for those analyses). All filtered sample groups included even sample sizes (29 or 30 individuals after filtering and assessing for relatedness, Table 1)—one sample was removed because of relatedness (Figure S1, SI Methods) and one sample, described above, for high amounts of missing data. The Great Lakes odd 3-yr old group contained 15 individuals and this group was removed from analyses unless explicitly noted due to the imbalance in sample sizes.
Population genetic analysis
Variant Call Format (vcf) files were loaded into R v. 4.2.1. (R Core Team, 2022) using the gaston v1.5.7 package (Perdry et al., 2020). Using sample group-specific vcfs that were not filtered for minor allele frequency (to maintain rare alleles; Table S2), we calculated the number of SNPs both genome-wide and in 2.5 Mbp windows. Genome-wide estimates of observed heterozygosity were calculated using custom scripts in 2.5 Mbp windows. Tajima’s D was calculated in VCFtools using the same window sizes but using the joint vcf before it was filtered on minor allele frequency (Table S2); Tajima’s D is a sequence-based estimator and can be upwardly biased if invariant or low-frequency sites are not included. We used the R package SNPRelate v.1.32.2 (Zheng et al., 2012) to calculate mean pairwise FST (Weir & Cockerham, 1984), which used a linkage disequilibrium pruned (R2 £ 0.2) subset of 123,924 SNPs. As an additional measure of population structure, we used the program fastSTRUCTURE v. 1.0-py27 (Raj et al., 2014) with our joint vcf to generate population assignments using a subset of 10,000 randomly selected SNPs using K values ranging from 1-5.
Effective population size estimates
We used two methods to understand the likely bottleneck and relative population sizes at introduction and in our contemporary samples. These two methods were employed based on their respective strengths and weaknesses: 1) the site frequency spectrum-based method (Stairway Plot 2) performs well at detecting historical changes in population size (Liu & Fu, 2020), and 2) the linkage-disequilibrium based method (GONE) (Santiago et al., 2020) performs well at estimating recent demographic shifts and contemporary (<100 generations) estimates of effective population size (Reid & Pinsky, 2022; Santiago et al., 2020). We used default input parameters in GONE which samples 50,000 random SNPs per chromosome and ran the program 100 times. We next calculated the average of those runs and computed the 95% quantiles. To compare with estimates of effective population sizes from farther back in time, we used Stairway Plot 2 (see SI Methods, Results).
SNP variant analysis and annotation
For outlier SNPs under putative selection, we first calculated (Weir & Cockerham, 1984) FST using VCFtools between the BC odd and GL odd samples. To identify outlier regions, we calculated sliding window outliers using a 100 kbp window with a 50 kbp step, for which we calculated ZFST relative to the chromosome a window was on and considered any window ³5 ZFST as a putative outlier region. We chose the 5 ZFST threshold as an especially conservative first metric for regions putatively under selection and then used a marginally less conservative SNP cutoff for individual locus annotation using snpEff (discussed below). For SNPs within putative outlier regions, we similarly transformed per SNP FST with a Z-transformation normalized across the chromosome; any SNP with ³4 ZFST was retained for variant annotation. We visually compared our outlier windows with Tajima’s D values calculated using VCFtools in 10 kbp steps. To validate these results, we conducted the same analyses using the same BC odd and the closely related GL even sample groups to ensure the same outlier windows were maintained. We also used eigenGWAS, an eigenvector decomposition-based method using a genome-wide association study approach (GWAS), but where genotypes are treated as the phenotypes in the GWAS (Chen et al., 2016). This method was similarly used to discover highly differentiated windows that would be consistent with putative selection at a locus. Variants from eigenGWAS were plotted against ZFST windows to further verify our best candidate outliers. We considered any window with a mean log transformed p-value ³ 6.64 (Bonferroni corrected p-value = 0.01) as significant.
We next used the programs snpEff and snpSift (v. 5.1) to annotate and describe the effect of outlier SNPs in our data set (Cingolani, Patel, et al., 2012; Cingolani, Platts, et al., 2012). We first filtered the joint vcf for individual SNPs as discussed above (³4 ZFST) so that we only used SNPs highly differentiated between sample groups. The program snpEff uses the reference genome assembly and associated annotation report to determine the functional effect (e.g., missense, synonymous) of a given SNP relative to the assembly. Individual SNPs could have multiple effects, for example they may occur in the intronic region of one gene, in addition to occurring upstream of a separate gene.
Simulation of genetic drift for candidate loci
We leveraged estimated pink salmon effective population sizes through time (GONE results) to determine whether the empirically observed changes in allele frequencies were due to genetic drift alone at our best candidate locus (per2, which had a total of 38 SNPs; see Results). To do this, we first created an agent-based model that simulated the unique life-history characteristics of pink salmon (see SI Methods for full details). The carrying capacity of the adults was set equal to the mean effective population size calculated from GONE in each generation (averaged across the 100 GONE replicates) and was varied each year in the model. This approach was conservative as the ratio of Ne to Nc often equal 10-20% (Frankham, 1995; Waples, 2002); the smaller