2.4 Statistical Analysis
We tested for genetic differences among and within populations in
circadian period and the range of circadian period. Because data
distributions violated the assumptions of normality and
heteroscedasticity, we also applied a non-parametric test, Welch’s
Heteroscedastic F test (Welch 1951) in R 3.6.3 with package
“onewaytest” (Dag et al. 2018), to both the test of significant
differences among populations as well as the test of significant
differences among families within a population. The effect of “trial”
was also tested, as 18 separate trials were conducted to screen the
complete number of individual replicates. General linear mixed models
were used to test for the statistical significance of different factors,
with “population” as a fixed effect and “family within population”
as a random effect. The linear mixed models were tested using the lme4
package (Bates et al. 2015) with the restricted maximum likelihood
(REML) approach.
To determine if the B. stricta populations were structured by
simple proximity and to evaluate global spatial autocorrelation for
circadian period, we estimated two spatial statistics, Moran’s I and
Geary’s C (Geary 1954, Moran 1950). Moran’s I is a standard for spatial
data and is widely utilized to provide an overall statistic for
large-scale analysis of spatial patterns. Geary’s C is better used to
determine differences between pairs of observations and can be more
sensitive to smaller neighborhoods. Within the data, some pairs of
populations were closer than others, indicating that Geary’s C was more
appropriate. To initially evaluate global spatial autocorrelation we
used Moran’s I, which considers the directionality of spatial
association among populations. With Moran’s I, values center around 0,
with a negative statistic indicating clustering of dissimilar values and
a positive statistic suggesting the clustering of similar values; “0”
would indicate randomness and no autocorrelation. For these analyses, we
used the population mean and the population range (value from mean of
shortest family to longest family). Spatial regression was used to
determine if the values for spatial autocorrelation affect the overall
distribution of the populations within the environmental variables. To
test for spatial dependence in the regression, spatial error (spatial
correlation between error terms) and spatial lag (using a variable to
account for autocorrelation) models with elevation, mean annual
temperature, annual precipitation, and soil texture as predictor
variables were used. We tested for the significance of the spatial
autocorrelation, and used Akaike Information Criterion (AIC) to identify
the best-fit linear models, specifically if models with or without the
spatial estimates were better. Analyses were conducted in R 3.6.3 with
packages “sp”, “spdep”, “rgdal”, “spgwr”, and “spatstat”
(Baddeley and Turner 2005, Bivand et al. 2013, Bivand et al. 2021,
Bivand and Yu 2020, Pebsema et al. 2005, R Core Team 2020).
To test for associations between circadian values and environmental
variables, we first used multivariate linear regression. We tested the
response variables for the circadian period of population mean and
within-population range. All 27 environmental predictors were included
in the models. We first fit the complete model with all variables, and
then used AIC modeling to determine the best fit model. Environmental
variables, however, exhibited significant multicollinearity, and we
therefore used principal component analysis to reduce the dimensionality
of the data. Analyses were conducted in R 3.6.3 with packages
“mctest”, “GGally”, and “corpcor” (Imdadullah et al. 2016, Schafer
et al. 2017, Schloerke et al. 2021) for testing multicollinearity and
“ggbiplot” (Vu 2011) for principal component analysis. Having reduced
data dimensionality, we used partial least squares regression (PLS) and
principal component regression (PCR), where the predictor variables
included all climate and soil factors and the response variables were
mean circadian period and the interpopulation range of circadian period.
PCR first computes the principal components of the predictors, and then
uses these components as predictors in a regression against the response
variable (Jolliffe 1982). PLS regression is a similar analysis to PCR,
but works in a supervised framework for the predictors as they are
combined into the components (World and Eriksson 2001). Variables were
standardized by dividing each by its standard deviation. The strongest
model was used based on cross-validation both within each model and as a
comparison of the PLS and PCR models. After the optimal model was
determined, we calculated the contribution of each coefficient. For this
analysis, we used R 3.6.3 and packages “pls” and “caret” (Kuhn 2020,
Mevik et al. 2020).