Genetic Variation

Alleles for 10,709,466 biallelic Single Nucleotide Polymorphisms (SNPs) scored across 2029 Arabidopsis genotypes were retrieved from publicly available data (Arouisse et al., 2020). The genotypes used are inbred lines made homozygous through selfing and single-seed descent, so allelic states can be coded 0 (homozygous for the reference allele) or 1 (homozygous for the alternative allele) with no heterozygotes. We filtered SNP data to remove SNPs with missing call rate > 0.05 and rare variants with minor allele frequency lower than 0.01. SNPs were then pruned using a window size of 500kb, a variant step count of 100 and a pairwise linkage threshold r2 = 0.1, retaining 86,760 SNPs. All filtering and pruning were conducted in PLINK v190b6.10 (Purcell et al., 2007).
Pruned SNPs were used to compute a genetic similarity matrix (GSM; Speed & Balding, 2015). The GSM is a square matrix with entries that measure pairwise similarity between individual genotypes. We compared several methods of constructing GSMs but found they did not affect model performance and that a GSM rendered individual markers redundant as predictors (Appendix S2). Since using a precomputed GSM is more computationally practical than including numerous SNPs for each model run, we decided to only quantify genetic variation through an identity-by-state GSM. Identity-by-state was preferred because it can be computed for any pair of individuals, including novel ones.