When validated SV call sets are lacking, we recommend using a generalist SV discovery program that combines multiple SV discovery algorithms to target a range of SVs at once. These generalist programs enable a combination of assembly-based methods alongside read depth, read pairs and/or split-read approaches (i.e., GRIDSS, Cameron et al., 2017; Manta, Chen et al., 2016). These programs not only perform well on their own, but have the added benefit of performing well over a range of SV types (Cameron et al., 2019; Kosugi et al., 2019). Using a generalist program has an additional benefit that it overcomes the challenge faced by many ensemble methods, where distinguishing between true variants and false positives is difficult due to the significant overlap in false positives across methods (Cameron et al., 2019). However, conservation geneticists should be aware that the computational resources required to characterize SVs at the population scale using generalist SV discovery programs are substantial (e.g., for a diploid 1.15Gb genome with 170 individuals sequenced to ≥25x coverage, 72 physical cores, 460 Gb RAM, >3 Tb storage was required to implement paired-, split-read and assembly-based SV discovery algorithms; JRW personal observation).

Multiple reference genomes improve genome-wide structural variant discovery and genotyping

A pangenome is the aggregate characterization of genomic variation present in a group of interest, including species and populations (e.g., variation between strains of tomato, Alonge et al., 2020). Pangenomes offer a straightforward solution to address the challenges associated with SV discovery and genotyping with short-read sequence data. Although originally developed to characterize variation in bacteria (Tettelin et al., 2005), they are commonly used in studies of trait diversity in humans (Pang et al., 2010) and agriculturally significant species (e.g., cattle, goats, soybean, and maize; D. M. Bickhart et al., 2020; Della Coletta, Qiu, Ou, Hufford, & Hirsch, 2021; Golicz et al., 2016; Liu et al., 2020; Low et al., 2020; McHale et al., 2012; Yang et al., 2019). There are two components of a pangenome: the ‘core’ genomic regions that do not vary among individuals, and ‘accessory’ genomic regions that vary among individuals (Bayer, Golicz, Scheben, Batley, & Edwards, 2020; Golicz, Batley, & Edwards, 2016; Hurgobin & Edwards, 2017; Figure 4). In a pangenomic approach, genomes of multiple individuals are assembled de novo using multiple platforms (e.g., long reads, Hi-C, Optical mapping; Song et al., 2020; Soto et al., 2020; Weissensteiner et al., 2020; Zhou et al., 2019), followed by pairwise comparisons of whole-genome alignments for SNP and SV discovery (e.g., Cortex, MUMmer, Minimap2; Delcher et al., 1999; Iqbal, Caccamo, Turner, Flicek, & McVean, 2012; H. Li, 2018). Once variant discovery is complete, genome graphs representing the variation in the pangenome may be constructed to efficiently represent ‘core’ and ‘accessory’ regions (Eizenga et al., 2020; Li, 2018; Rakocevic et al., 2019; Tettelin et al., 2005) that are encompassed in analyses of copy number variation (CNV) and presence-absence variation (PAV) (Della Coletta et al., 2021). Genome graphs are a powerful method for population-level genotyping and consistently outperform alignment-based genotyping (e.g., Ebler et al., 2020 preprint; Eggertsson, 2017; Iqbal et al., 2012; D. Kim, Paggi, Park, Bennett, & Salzberg, 2019; H. Li, 2018). As a result, pangenomic approaches can resolve complex variants (i.e., multiple overlapping events) that may otherwise go undetected in alignment-based approaches, hampering the discovery of causal variants (Alonge et al., 2020; McHale et al., 2012).