Model Training Guidelines

Environmental variation was also an important determinant of model performance in external validation, with predictions being more accurate when the training set contained data from an environment ‘similar’ to the testing set. This environmental ‘similarity’ depended on both seasonal similarity and the extent of GxE present within each environment. For example, DTB in VF was well-predicted because the training set contained data from HF and NF, two mild fall plantings that exhibited high variation in DTB. In contrast, DTB in OF was consistently underpredicted. Despite being another fall planting, Scandinavian and non-Scandinavian falls were functionally distinct environments. This is likely because the harshness of Scandinavian falls and winters causesA. thaliana to be obligate winter cyclers that flower only in spring (Exposito-Alonso, 2020). The simultaneous spring flowering resulted in minimal variation in DTB among OF plants and reflected the minimal GxE present under extreme conditions. This contrasts the higher variation and greater GxE seen further south where the plants are facultative winter cyclers (Li et al., 2014). A model fitted to data from the early-bolting, highly variable fall cohorts (VF, HF, NF) could not improve prediction accuracy in the late-bolting, minimally-variable OF cohort. Taken together, these results indicate gathering environmentally diverse data is crucial to maximising confidence in out-of-sample predictions.
Although our study was performed in the well-characterised model speciesA. thaliana , we only used generic, easy-to-obtain data and avoided A. thaliana -specific biological assumptions to ensure our framework is transferable to non-model species. We computed GSMs using genomic SNPs selected without prior knowledge of their association with DTB or SP and defined environmental conditions using only temperature. In theory, additional predictors known to be biologically relevant could have been included like herbivory, soil nutrient level, and soil microbial composition (C. R. Fitzpatrick et al., 2019; Krannitz et al., 1991; Sills & Nienhuis, 1995; Weinig et al., 2003). In practice, doing so would decrease model transferability. The information required to generate predictions in novel conditions becomes harder to obtain as predictors become more specific and greater ecophysiological knowledge of the study species is required. The most resource-intensive component of our framework is obtaining data from multiple environments, an issue exacerbated by the inclusion of additional predictors. Still, this data may already exist for species relevant to revegetation. Provenance testing in trees, for example, has been carried out for centuries (Mátyás, 1996).