Model Training Guidelines
Environmental variation was also an important determinant of model
performance in external validation, with predictions being more accurate
when the training set contained data from an environment ‘similar’ to
the testing set. This environmental ‘similarity’ depended on both
seasonal similarity and the extent of GxE present within each
environment. For example, DTB in VF was well-predicted because the
training set contained data from HF and NF, two mild fall plantings that
exhibited high variation in DTB. In contrast, DTB in OF was consistently
underpredicted. Despite being another fall planting, Scandinavian and
non-Scandinavian falls were functionally distinct environments. This is
likely because the harshness of Scandinavian falls and winters causesA. thaliana to be obligate winter cyclers that flower only in
spring (Exposito-Alonso, 2020). The simultaneous spring flowering
resulted in minimal variation in DTB among OF plants and reflected the
minimal GxE present under extreme conditions. This contrasts the higher
variation and greater GxE seen further south where the plants are
facultative winter cyclers (Li et al., 2014). A model fitted to data
from the early-bolting, highly variable fall cohorts (VF, HF, NF) could
not improve prediction accuracy in the late-bolting, minimally-variable
OF cohort. Taken together, these results indicate gathering
environmentally diverse data is crucial to maximising confidence in
out-of-sample predictions.
Although our study was performed in the well-characterised model speciesA. thaliana , we only used generic, easy-to-obtain data and
avoided A. thaliana -specific biological assumptions to ensure our
framework is transferable to non-model species. We computed GSMs using
genomic SNPs selected without prior knowledge of their association with
DTB or SP and defined environmental conditions using only temperature.
In theory, additional predictors known to be biologically relevant could
have been included like herbivory, soil nutrient level, and soil
microbial composition (C. R. Fitzpatrick et al., 2019; Krannitz et al.,
1991; Sills & Nienhuis, 1995; Weinig et al., 2003). In practice, doing
so would decrease model transferability. The information required to
generate predictions in novel conditions becomes harder to obtain as
predictors become more specific and greater ecophysiological knowledge
of the study species is required. The most resource-intensive component
of our framework is obtaining data from multiple environments, an issue
exacerbated by the inclusion of additional predictors. Still, this data
may already exist for species relevant to revegetation. Provenance
testing in trees, for example, has been carried out for centuries
(Mátyás, 1996).