SV discovery and genotyping beyond short-read sequence
data
Short read data is a useful starting point for identifying some SVs.
However, there are biases in the type and size that can be easily
detected due to the read length of short read data, this leaves many
larger and/or complex SVs undiscovered (Figure 3; also see
Ho et al., 2020).
When characterizing genomic features, there are many sequencing
platforms and approaches to choose from, and although they may perform
well when addressing specific challenges, each has its own caveats
(Table 1).
Two providers prominently feature in long-read sequencing: Pacific
Biosciences (PacBio) and Oxford Nanopore Technologies (ONT). Since the
launch of these technologies in 2011 and 2014, the long-read sequencing
space has been characterized by fast-paced progress and innovation as
demonstrated by the first telomere to telomere assembly of the human X
chromosome achieved with ultra-long-read sequencing
(Miga et al., 2020).
The precise error rates between these two technologies remain somewhat
contentious (Dohm,
Peters, Stralis-Pavese, & Himmelbauer, 2020; Lang et al., 2020
preprint), but as a general rule, ONT currently provides longer average
read lengths than PacBio overall
(Logsdon, Vollger, &
Eichler, 2020) at the cost of higher sequence error rates. Despite
these challenges, the power of long-read sequencing technologies to span
a significant portion, if not the entire length, of complex regions of
the genome in a single read provides a powerful tool for SV discovery
and population-level genotyping. When used in conjunction with a
high-quality, contiguous, well annotated reference genome, this improves
confidence in read mapping genome-wide
(Amarasinghe et al.,
2020 for review), and substantially increases precision (the proportion
of variant calls that are ‘true’) and recall (the proportion of ‘true’
SVs detected) rates for both SNPs and SVs
(Wenger et al.,
2019). In addition, platforms that directly sequence native DNA remove
the amplification bias common in many short-read sequencing approaches
(Depledge et al.,
2019). Furthermore, there are emerging ‘adaptive’ sequencing approaches
that have the potential to selectively sequence specific regions of the
genome (Payne et al.,
2020 preprint). It remains to be seen however if this technology is
ready for wide use beyond human clinical applications.
Structural variants significantly alter genome topology and impact the
gene regulatory landscape
(Sadowski et al.,
2019; Shanta et al., 2020). In light of these impacts, the hierarchical
organisation of DNA within the nucleus is of particular interest when
investigating the relationship of transcriptional regulation mechanisms.
Chromatin conformation capture (3C) based sequencing approaches enable
the investigation of the organisation of chromatin genome-wide
(Kong & Zhang, 2019
for review) and identified the chromatin signature in gene expression
(Lieberman-Aiden et
al., 2009; Lupiáñez et al., 2015; Shanta et al., 2020). In addition,
there are emerging advancements in Nanopore sequencing methods to
integrate chromatin conformation capture with long-read sequencing
(i.e., Pore-C;
Ulahannan et al., 2019 preprint). Rather than the amplification bias
introduced by preparing a short-read library, long-read sequencing
provides data on chromatin at a range of distances along the linear
genome and enables contacts to be sequenced without amplification.
However, long-read sequence data alone cannot consistently resolve whole
chromosomes (Belser et
al., 2018). Optical mapping approaches are a useful complement to
long-read sequencing approaches, and have enhanced genome assembly
outcomes by providing insights into the ‘big picture’ of large-scale
genomic variants (as
per Weissensteiner et al., 2020). Optical mapping utilises a technique
based on light-microscopy to identify specific sequence motifs (such as
restriction enzyme cut sites), which are then used to generate images of
fluorescently-labeled DNA molecules
(Schwartz et al.,
1993), enabling the characterization of large, complex rearrangements
missed by long-reads alone
(Yuan, Chung, & Chan,
2020). On average, optical maps span ~225 kb, providing
information on the physical distance and relationship among genomic
features. Besides being used to improve the scaffolding of genome
assemblies (Howe &
Wood, 2015; Zhang, 2015), including those of endangered species
(S. Li et al., 2014),
optical mapping methods directly enable the identification of both
intraspecific and interspecific SVs
(Levy-Sakin et al.,
2019; Zhihai et al., 2016). The primary current commercial provider of
optical mapping technology is Bionano Genomics and their Saphyr
instrument, which uses a nano-channel microfluidic chip to linearise and
capture images of fluorescently-labeled ultra-long DNA fragments to
generate optical maps at a resolution of 500bp
(Yuan et al., 2020).
While optical maps provide information on the physical topology of
chromosomes, they do not provide sequence information on an allele.
Because long-reads and optical maps complement each other, the ideal
data set for SV discovery would include both data types
(e.g., Soto et al.,
2020; Weissensteiner et al., 2020).