Future work and similarity with other methods
Inbred or haploid genotypic datasets enjoy huge quality advantages over
heterozygous datasets at comparable levels of sequencing depth. This
study used a minimum depth threshold of 5 for P1 and P2 datasets, which
should theoretically lead to 93.75% of truly heterozygous sites being
called correctly (assuming no amplification bias) and which actually
resulted in ~80-90% of the raw data being discarded
(Table 1). The luxury of relaxing or removing depth thresholds in inbred
datasets results in retention of much more data, and summarizing
heterozygosity by taxa or by SNP in inbred datasets simplifies the
removal of cross-contaminated DNA samples and homeo-SNPs respectively.
In this study, dual alignment of reads from interspecific hybrids to
both parental genomes (P1+P2) resulted in effectively inbred datasets
that enabled more rigorous quality control, displayed higher concordance
following downsampling, and provided more robust estimation of
population structure compared to standard alignment against a single
reference genome. Although this study used Beagle imputation for
purposes of comparing different alignment strategies, datasets resulting
from dual alignment could also be imputed using FSFHap, an imputation
method designed for inbred populations (Swarts et al., 2014), whereas P1
and P2 datasets could not. The practical conclusion of this study is
that dual alignment allows interspecific hybrids to be genotyped and
imputed as efficiently and inexpensively as inbreds.
The divergence between parental genomes in this study is estimated at 38
million years for Pistacia (P. atlantica vs P.
integerrima) (Xie et al., 2014) and 45 million years for Juglans
(J. microcarpa vs J. regia) (Stevens et al., 2018). This study
used 90 bp Illumina reads trimmed to 64 bp for speedier processing
through the TASSEL GBS pipeline (Glaubitz et al., 2014), of which 65%
and 76% mapped uniquely to the Pistacia and Juglans P1+P2
genomes respectively. Longer reads could be used to apply this strategy
to hybrids with lower divergence, and perhaps even hybrids between
heterotic groups within a species. Alternatively, strategies that make
use of a “pan-genome”, including the Practical Haplotype Graph
(Bradbury et al., 2022), may achieve a similar result by including
enough representative reference contigs to ensure that all reads align
to a homologous (non-homeologous) sequence. The strategy described here
could also be applied to transcriptome data of hybrids to investigate
allele-specific or species-specific patterns of expression and
co-expression.