SNP calling and filtering
No reference genome is available for ponderosa pine (Pinus
ponderosa ), but one does exist for loblolly pine (Pinus taeda)(Neale et al. 2014; Zimin et al. 2014). Of the conifers
that have been sequenced to date, P. taeda is the most closely
related to P. ponderosa (Gernandt et al. 2009; Willyardet al. 2009). Furthermore, the P. taeda reference genome
was successfully used to design probes for sequence capture in P.
contorta (Suren et al. 2016; Yeaman et al. 2016), a
distant relative. Based on preliminary analyses, we selected the Stacks
v.2.2 pipeline (Rochette & Catchen 2017) with this reference genome
(https://treegenesdb.org/FTP/Genomes/Pita/) for SNP calling (Shu
2020). Each step in the Stacks reference pipeline is performed
internally in Stacks algorithms except alignment with BWA v.0.7.17 (Li
& Durbin 2009) and the Samtools v.1.9 (Li 2011) step used to get read
position. Default settings were used in Stacks, BWA and Samtools.
After calling the SNPs, we ran SnpEff (Cingolani et al. 2012) to
identify the location of the gene containing each SNP. We used the
database of annotated genome and the reference genome of loblolly pine
v.2.01 in TreeGenes
(http://treegenesdb.org/FTP/Genomes/Pita/v2.01/). The location of
each SNP is listed in the output file of SnpEff as one of six primary
location categories, including intragenic variants, intergenic variants,
upstream SNPs, downstream SNPs, synonymous, and missense variants in the
gene coding sequence. In Snp Eff, ”intragenic” refers to SNPs in
introns, while ”missense” refers to any non-synonymous mutation in the
transcribed region.
Many SNPs identified by GBS fall between genes and regulatory regions
(in the intergenic regions) and likely have no direct effect on gene
expression or function. In addition, because of the low amount of
linkage disequilibrium in conifers (Namroud et al. 2008; Isiket al. 2016), any associations identified between such intergenic
SNPs and a phenotype or environment of interest are likely false
positives rather than reflecting linkage between the SNP and a causal
variant. Therefore, we first filtered out the intergenic SNPs before
running the association analysis using a Python script
(https://github.com/shumengjun/LFMM).