2.7 Genotype-environment associations: Seascape Genomics
To assess how environmental variables shaped the genetic structure of North Atlantic harbour porpoises, we developed a seascape genomics approach. Specifically, we studied genotype-environment associations (GEA) to identify putative SNPs underlying local adaptation by carrying out a redundancy analysis (RDA) with the R package vegan(Oksanen, 2011). RDA is a multivariate method that finds linear dependencies between response (genotypes) and explanatory variables (environmental predictors). Previous authors (Forester et al., 2018, Capblancq et al., 2018) have found that multivariate GEA analyses, especially constrained ordination approaches such as RDA, detect more effectively multilocus selection than univariate methods and have a superior combination of low false-positive and high true-positive rates (Grummer et al., 2019; Capblancq & Forester 2021). RDA was carried out at the individual level using two datasets, with and without BLS. To control for spatial autocorrelation and other demographic processes, pairwise oceanic distances were calculated from the coordinates of the samples with the R packagemarmap (Pante & Simon-Bouhet, 2013). Subsequently, pairwise oceanic distances were transformed to distance-based Moran’s eigenvector maps (dbMEMs ) with the R package adespatial (Dray et al., 2012) and used as the space variable in the RDA.
Five environmental variables previously identified as significantly associated with genomic variation of Common (Delphinus delphis ) (Barceló et al., 2022) and Bottlenose dolphins (Tursiops truncatus ) (Pratt et al., 2022) were selected as potential predictors of harbour porpoise genomic variation: sea surface temperature (SST), sea surface salinity (SSS), sea current velocity (SCV), sea chlorophyll-A concentration (SCA), and sea primary productivity (SPP). For each of the variables, the annual maximum, mean, minimum and range values were downloaded from the database BioOracle (v2.2)(Tyberghein et al., 2012, Assis et al., 2018). SCVmax and SPPmax were not available in this database, yielding altogether 18 variables. As genomic input we used a set of unlinked called genotypes, identified byPlink (v1.9) (Purcell et al., 2007) with a window size of 20kb, a step size of 10kb and a r2 threshold of 0.1.
To maximize the genetic variance explained by our set of environmental predictors we performed a forward selection with the functionforward.sel of the R package adespatial . The forward selection was carried out for the environmental variables anddbMEMs separately, and only variables explaining a significant proportion (p<0.05) of the genomic variation were retained. To control for multicollinearity, only predictors with a highly conservative variance inflation factor (VIF)<3 were retained. An RDA-based variance partitioning (Capblancq & Forester, 2021) was performed to estimate the independent contribution of the environmental and spatial variables. First, we ran an unconstrained RDA with the retained environmental and spatial variables as predictors to find the variance explained by the full model. Then, we calculated the variance explained by the environmental variables once the influence of spatial variables had been removed by applying a constrained pRDA. Finally, we checked the significance of the RDAs and pRDAs as well as the significance of the environmental variables by an analysis of variance (ANOVA) with 1,000 permutations.
Candidate loci under selection were identified by applying a 3 standard deviation cut-off (p=0.0027) on the SNPs loading scores (Forester et al., 2018) from the two first redundancy axis and we determined to which environmental variable each candidate SNPs is most associated. To further control for false positives, we carried out an additional selection genome scan based on individual genotypes with the R packagePcadapt (Privé et al., 2020). We used four principal components (K), identified by computing a scree plot and choosing the K that minimized the genomic inflation factor (GIF). Then we transformed the p-values in q-values with the R package qvalue and applied a false discovery rate of 0.1 to control for false positives.
To identify candidate genes under local adaptation, we used only the candidate SNPs detected by all three methods RDA-pRDA-Pcadapt on the dataset without BLS and that were associated to salinity (271). We extracted 300 bp of flanking sequences for each candidate SNP resulting in 601 bp long sequences, each containing a single SNP. We then performed a basic local alignment with BLASTN using the nucleotide database of NCBI with an e-value of 1x10-3. After extracting hits per query, we identified gene function and associated biological function with the PubMed and GeneCards (Stelzer et al., 2016) databases.