2.4 Statistical Analysis
We tested for genetic differences among and within populations in circadian period and the range of circadian period. Because data distributions violated the assumptions of normality and heteroscedasticity, we also applied a non-parametric test, Welch’s Heteroscedastic F test (Welch 1951) in R 3.6.3 with package “onewaytest” (Dag et al. 2018), to both the test of significant differences among populations as well as the test of significant differences among families within a population. The effect of “trial” was also tested, as 18 separate trials were conducted to screen the complete number of individual replicates. General linear mixed models were used to test for the statistical significance of different factors, with “population” as a fixed effect and “family within population” as a random effect. The linear mixed models were tested using the lme4 package (Bates et al. 2015) with the restricted maximum likelihood (REML) approach.
To determine if the B. stricta populations were structured by simple proximity and to evaluate global spatial autocorrelation for circadian period, we estimated two spatial statistics, Moran’s I and Geary’s C (Geary 1954, Moran 1950). Moran’s I is a standard for spatial data and is widely utilized to provide an overall statistic for large-scale analysis of spatial patterns. Geary’s C is better used to determine differences between pairs of observations and can be more sensitive to smaller neighborhoods. Within the data, some pairs of populations were closer than others, indicating that Geary’s C was more appropriate. To initially evaluate global spatial autocorrelation we used Moran’s I, which considers the directionality of spatial association among populations. With Moran’s I, values center around 0, with a negative statistic indicating clustering of dissimilar values and a positive statistic suggesting the clustering of similar values; “0” would indicate randomness and no autocorrelation. For these analyses, we used the population mean and the population range (value from mean of shortest family to longest family). Spatial regression was used to determine if the values for spatial autocorrelation affect the overall distribution of the populations within the environmental variables. To test for spatial dependence in the regression, spatial error (spatial correlation between error terms) and spatial lag (using a variable to account for autocorrelation) models with elevation, mean annual temperature, annual precipitation, and soil texture as predictor variables were used. We tested for the significance of the spatial autocorrelation, and used Akaike Information Criterion (AIC) to identify the best-fit linear models, specifically if models with or without the spatial estimates were better. Analyses were conducted in R 3.6.3 with packages “sp”, “spdep”, “rgdal”, “spgwr”, and “spatstat” (Baddeley and Turner 2005, Bivand et al. 2013, Bivand et al. 2021, Bivand and Yu 2020, Pebsema et al. 2005, R Core Team 2020).
To test for associations between circadian values and environmental variables, we first used multivariate linear regression. We tested the response variables for the circadian period of population mean and within-population range. All 27 environmental predictors were included in the models. We first fit the complete model with all variables, and then used AIC modeling to determine the best fit model. Environmental variables, however, exhibited significant multicollinearity, and we therefore used principal component analysis to reduce the dimensionality of the data. Analyses were conducted in R 3.6.3 with packages “mctest”, “GGally”, and “corpcor” (Imdadullah et al. 2016, Schafer et al. 2017, Schloerke et al. 2021) for testing multicollinearity and “ggbiplot” (Vu 2011) for principal component analysis. Having reduced data dimensionality, we used partial least squares regression (PLS) and principal component regression (PCR), where the predictor variables included all climate and soil factors and the response variables were mean circadian period and the interpopulation range of circadian period. PCR first computes the principal components of the predictors, and then uses these components as predictors in a regression against the response variable (Jolliffe 1982). PLS regression is a similar analysis to PCR, but works in a supervised framework for the predictors as they are combined into the components (World and Eriksson 2001). Variables were standardized by dividing each by its standard deviation. The strongest model was used based on cross-validation both within each model and as a comparison of the PLS and PCR models. After the optimal model was determined, we calculated the contribution of each coefficient. For this analysis, we used R 3.6.3 and packages “pls” and “caret” (Kuhn 2020, Mevik et al. 2020).