SV discovery and genotyping beyond short-read sequence data

Short read data is a useful starting point for identifying some SVs. However, there are biases in the type and size that can be easily detected due to the read length of short read data, this leaves many larger and/or complex SVs undiscovered (Figure 3; also see Ho et al., 2020). When characterizing genomic features, there are many sequencing platforms and approaches to choose from, and although they may perform well when addressing specific challenges, each has its own caveats (Table 1).
Two providers prominently feature in long-read sequencing: Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT). Since the launch of these technologies in 2011 and 2014, the long-read sequencing space has been characterized by fast-paced progress and innovation as demonstrated by the first telomere to telomere assembly of the human X chromosome achieved with ultra-long-read sequencing (Miga et al., 2020). The precise error rates between these two technologies remain somewhat contentious (Dohm, Peters, Stralis-Pavese, & Himmelbauer, 2020; Lang et al., 2020 preprint), but as a general rule, ONT currently provides longer average read lengths than PacBio overall (Logsdon, Vollger, & Eichler, 2020) at the cost of higher sequence error rates. Despite these challenges, the power of long-read sequencing technologies to span a significant portion, if not the entire length, of complex regions of the genome in a single read provides a powerful tool for SV discovery and population-level genotyping. When used in conjunction with a high-quality, contiguous, well annotated reference genome, this improves confidence in read mapping genome-wide (Amarasinghe et al., 2020 for review), and substantially increases precision (the proportion of variant calls that are ‘true’) and recall (the proportion of ‘true’ SVs detected) rates for both SNPs and SVs (Wenger et al., 2019). In addition, platforms that directly sequence native DNA remove the amplification bias common in many short-read sequencing approaches (Depledge et al., 2019). Furthermore, there are emerging ‘adaptive’ sequencing approaches that have the potential to selectively sequence specific regions of the genome (Payne et al., 2020 preprint). It remains to be seen however if this technology is ready for wide use beyond human clinical applications.
Structural variants significantly alter genome topology and impact the gene regulatory landscape (Sadowski et al., 2019; Shanta et al., 2020). In light of these impacts, the hierarchical organisation of DNA within the nucleus is of particular interest when investigating the relationship of transcriptional regulation mechanisms. Chromatin conformation capture (3C) based sequencing approaches enable the investigation of the organisation of chromatin genome-wide (Kong & Zhang, 2019 for review) and identified the chromatin signature in gene expression (Lieberman-Aiden et al., 2009; Lupiáñez et al., 2015; Shanta et al., 2020). In addition, there are emerging advancements in Nanopore sequencing methods to integrate chromatin conformation capture with long-read sequencing (i.e., Pore-C; Ulahannan et al., 2019 preprint). Rather than the amplification bias introduced by preparing a short-read library, long-read sequencing provides data on chromatin at a range of distances along the linear genome and enables contacts to be sequenced without amplification. However, long-read sequence data alone cannot consistently resolve whole chromosomes (Belser et al., 2018). Optical mapping approaches are a useful complement to long-read sequencing approaches, and have enhanced genome assembly outcomes by providing insights into the ‘big picture’ of large-scale genomic variants (as per Weissensteiner et al., 2020). Optical mapping utilises a technique based on light-microscopy to identify specific sequence motifs (such as restriction enzyme cut sites), which are then used to generate images of fluorescently-labeled DNA molecules (Schwartz et al., 1993), enabling the characterization of large, complex rearrangements missed by long-reads alone (Yuan, Chung, & Chan, 2020). On average, optical maps span ~225 kb, providing information on the physical distance and relationship among genomic features. Besides being used to improve the scaffolding of genome assemblies (Howe & Wood, 2015; Zhang, 2015), including those of endangered species (S. Li et al., 2014), optical mapping methods directly enable the identification of both intraspecific and interspecific SVs (Levy-Sakin et al., 2019; Zhihai et al., 2016). The primary current commercial provider of optical mapping technology is Bionano Genomics and their Saphyr instrument, which uses a nano-channel microfluidic chip to linearise and capture images of fluorescently-labeled ultra-long DNA fragments to generate optical maps at a resolution of 500bp (Yuan et al., 2020). While optical maps provide information on the physical topology of chromosomes, they do not provide sequence information on an allele. Because long-reads and optical maps complement each other, the ideal data set for SV discovery would include both data types (e.g., Soto et al., 2020; Weissensteiner et al., 2020).