Main Body
Introduction
Next-generation sequencing (NGS ) technology has fundamentally
altered how scientists identify the genetic basis of disease. Rather
than sequencing and interrogating one gene at a time, clinicians and
researchers now routinely sequence at a genome-wide level and, in turn,
face a new problem: isolating the critical disease-causing, pathogenic
variants from millions of others. While most variants are easily
excluded upon the basis of frequency and gene biology, there are many
variants of unknown/uncertain significance (VUS ) for which it
is not evident whether a variant is pathogenic or benign.
The term VUS is an exceptionally loaded with definitions and guidelines
coming from the American College of Medical Genetics (ACMG )
(Richards et al., 2018), Association for Molecular Pathology
(AMP ) (M. M. Li et al., 2017), and International Agency for
Research on Cancer (IARC ) (Plon et al., 2008). VUS have been
the subject of entire special issues of this journal, Human
Mutation , such as in 2008. The precise meaning of VUS depends upon
context. For example, in clinical testing entire variant classes (such
as synonymous variants) are presumed likely benign at baseline, ensuring
VUS are less frequently reported. Still, VUS can take a significant
psychological and emotional toll on patients and their families when
reported and may lead to confusion and clinical mismanagement.
Importantly, as we will discuss, VUS affect minority populations
disproportionately (Culver et al., 2013; Gelfman et al., 2017; Maurano
et al., 2012; Park et al., 2018; Qian et al., 2021; Vaz-Drago et al.,
2017). Therefore, VUS are an issue with far-reaching impact, both in
research and clinical reporting. At a simple level, the sheer numbers of
intronic, synonymous, and other deep variants contribute to their
overall predictive value as a variant class. Additionally, it is
well-recognized that many important disease-causing variants fall into
these neglected “opaque” areas of variant interpretation.
There is an overwhelming need for better tools capable of elucidating
the functional consequences of rare molecular events and, thereby,
illuminating how genetic variation impacts biology and disease
pathogenesis. Analysis of DNA-level information alone in the context of
clinical phenotypes is unlikely to clarify this information given rare
events’ limited sample size and lack of statistical power. Compared to
genomic analysis – which outlines the repertoire of functions encoded
in DNA that a cell could perform – transcriptomic analysis
enables a deeper understanding of how a cell
is actually functioning. In other words, RNA expression profiling
provides critical insight into how genotype translates into phenotype
(Mantione et al., 2014).
Here we will highlight how gene expression profiling can be used to
reveal the presence of deep intronic and splicing variants. We will
review the emerging role of RNA sequencing as a tool to differentiate
VUS as pathogenic vs. benign, especially in populations of diverse
races, ethnicities, and genetic ancestries as individuals of
non-European ancestry have significantly higher rates of VUS (Petrovski
& Goldstein, 2016). Finally, we will explore the ways in which
transcriptomic analysis can be translated into clinical diagnostics for
cancer (i.e., detection of causative variants, both germline and
somatic) and how it can be applied to address cancer disparities.
Leveraging the
transcriptome
The term “RNA-seq” was coined in 2008 and, around this same time,
several landmark studies hinted toward the potential diagnostic utility
of RNA sequencing data using short-read NGS. One major effort was the
ENCODE Project Consortium, which stated: “to understand the human
genome… and the ways in which its defects can give rise to
disease, we need a more transparent view of the information itencodes ” (Birney et al., 2007). An understanding of human
disease necessitates an understanding of the link between genomic
variation and transcription; this was the goal of the Genotype-Tissue
Expression (GTEx ) project started in 2010 (Consortium, 2013).
Analogous to the 1000 Genomes Project, which aimed to provide a
catalogue of common variation, GTEx aimed to catalogue genotype and
gene-expression correlations across dozens of tissues and hundreds
(eventually thousands) of individuals.
An improved ability to predict the consequences of genetic variation on
transcription looms as one of the largest opportunities and biggest
challenges for addressing VUS. RNA analysis allows for the appreciation
of phenomena like differential expression, allele-specific expression,
alternative splicing, isoform switching, and many other events that
would not be readily apparent in DNA data. Current approaches are
largely limited to annotation of variant classes that are both rare and
predictable for transcript disruption, such as nonsense mutations that
lead to loss of transcript expression. More common variant classes are
confounded by their higher prevalence and lower accuracy in predicting
deleterious impact to gene expression.
Germline variants and somatic mutations can impact RNA species in a
variety of ways, and they are often grouped into categories according to
functional impact (i.e. High, Moderate, Low/Modifier) using annotation
ontology software such as ANNOVAR (Wang et al., 2010; Yang &
Wang, 2015) and SnpEff (Cingolani et al., 2012). Variants falling
into the High functional impact category include nonsense
variants introducing premature stop-codons, stop-loss, 5’ untranslated
region (UTR ) premature start codons, and start-loss variants.
Many of these lead to lowered transcription, but also full removal of
transcripts through nonsense-mediated decay (NMD ) (Chakravorty
& Hegde, 2018). Another relatively rare variant class includes
splice-site disrupting variants within the +/-2 base pair (bp )
region of the exon-intron boundaries, which can impact splicing and
result in exon skipping, exon usage, mutually exclusive/inclusive exons,
and/or intron retention (Woolfe et al., 2010). Variants within theModerate category typically are predicted to impact amino acid
composition, including missense amino changes from non-synonymous
variants and in-frame insertions/deletions.
The remainder of ontology variants are more frequently found per genome
and are often pragmatically filtered out due to their low predictive
value. These include synonymous variants, regulatory region variants,
3-prime (3’) UTR variants, 5-prime (5’) UTR variants, and intronic
variants. It is essential to highlight splice region variants proximal
to exon boundaries. Indeed, as shown in Figure 1A ,
variant alleles and exonic position well beyond the 2bp exon
donor/acceptor window impact function. Variants annotated as missense
and synonymous can alter splicing, exposing so-called cryptic
splice-sites and splice enhancing or silencing elements (Woolfe et al.,
2010). Moreover, deeper within the intron, variants as far out as 18 bp
can impact spliceosome binding at a key motif referred to as the lariat
junction.
Many of these low impact variants are found in higher numbers per genome
and any given variant has a low probability of being functional. Still,
there are numerous examples whereby variants drive disease via cryptic
splice-sites (ex: HBB , IVS2+705T>G for
β-thalasamia7), UTRs (ex: GJB1 in Charcot-Marie-Tooth
neuropathy), and other deep intronic variants (ex: COL4A5IVS6+1873G>A in Alport Syndrome) (Dobkin et al., 1983; King
et al., 2002). Frequently, the mechanism of impact is evident in
transcriptome data. Indeed, novel RNA functions and phenomena have been
elucidated that extend far beyond the traditional protein-coding role
described by the “central dogma.” We are also beginning to better
understand the ways in which messenger RNA (mRNA ) can be
processed – chiefly by alternative splicing, RNA editing, and crosstalk
between the two processes – to create the great variety that
characterizes the human transcriptome (Tang et al., 2020). Examples of
unique phenomena into which gene expression data provides insight, and
which have significant disease implications, include splicing mutations,
deep intronic truncating mutations leading to NMD and loss of mRNA, and
allele-specific expression (ASE ).
Aberrant Splicing Mutations Beyond
2bp
Variants disrupting RNA splicing occur throughout the gene and, without
RNA-based functional evidence, can be annotated into less obvious
categories including as synonymous, intronic, and UTR variants.
Functionally, splicing occurs via a complex of proteins and small
nuclear RNAs called the spliceosome that form complementary RNA-RNA
complexes with target RNAs (Anna & Monika, 2018). The spliceosome
catalyzes the splicing of pre-mRNA into mRNA and circular lariat RNAs,
the latter of which are destroyed (Talhouarne & Gall, 2018). Variants
at the canonical boundary motifs between exons and introns are GT(at the 5’ end) and AG (at the 3’ end; Figure 1A )
are generally prioritized but, as shown in this figure, variants
upstream and downstream can disrupt function at a lower
certainty/frequency (Anna & Monika, 2018). Adding to the mechanistic
complexity is the existence of a variety of spliceosomes (“major” and
“minor”) as well as noncanonical — aka “cryptic” or “pseudo” —
splice site sequences (e.g. GC-AG and AT-AC ). Which splice
site is used largely depends upon the presence of cis-acting regulatory
elements (splicing enhancers vs. suppressors), as well as the physical
structures of a branch site and polypyrimidine tract (these sequences
bind spliceosome proteins) (Anna & Monika, 2018).
In order to minimize false positive rate, candidate variant lists
derived from DNA sequencing typically only take into considerationcanonical changes to boundary motifs (as mentioned earlier:
5’-GT and 3’-AG ) or variants that lie within 100 base
pairs of canonical splice sites (that is to say not deep intronic
variants). This approach is pragmatic, as these variants are both highly
penetrant and rare leading to overall high precision (or positive
predictive value). Without additional functional data, the large number
of variants further down- or up-stream lead to overall low precision.
RNA-seq provides a direct functional measure of splicing, allowing for
fewer variants, with higher inherent accuracy and precision, to be
considered. Furthermore, it provides a direct measure of spliced events.
As shown in Figure 1B , there are at least 7 classes of
splicing events including: (1) exon skipping, (2) mutually-exclusive
exons (i.e. a coordinated set of splicing events where, as a result,
only one of two exon is retained), (3) exon scrambling, (4) intron
retention, (5) alternative promoters and terminators, (6) alternative
donor sites and (7) alternative acceptor sites (aka alternative 5′ and
3′ splice sites). It also allows for consideration of variants that are
typically filtered out including synonymous variants, UTR variants, and
so by supplementation with transcriptomic data. (Chen & Weiss, 2015;
Pohl et al., 2013; Shi et al., 2018)
The emergence of RNA-seq in conjunction with germline DNA sequencing has
led to optimization, development, and training of new bioinformatic
tools for splice-site prediction and splice-site detection. Splice-site
prediction focuses on using advanced algorithms and artificial
intelligence (AI ) methods to predict from DNA variants alone
whether a variant impact splicing. Pragmatically, these approaches
suffer from the sheer number of variants and have higher utility when
RNA-seq data is also readily available, reducing the overall search
space. A useful metric when it comes to assessing alternative splicing
events is “PSI ” or percent spliced in; also known as the
exon-inclusion ratio, PSI indicates how often a given exon occurs in all
isoforms of the gene that contains said exon (Tanner et al., 2021).
Splice variant prediction tools typically employ one of two strategies:
adaptive models or random forest analysis. The essential difference
between these two methods is whether one uses large databases from which
non-pathogenic vs. pathogenic splice variants can be trained. Methods
like those employed by Liu et al (Liu et al., 2016) utilize random
forests to generate probability scores that can be used to estimate the
potential of a 3-12bp window within the lariat RNA junction.DeepSplice is an example of a tool that utilizes deep
convolutional neural networks, as well as paired events to reduce false
positives (Zhang et al., 2018). SpliceAI uses a deep neural
network to model mRNA splicing from noncoding sites, yielding a 10%
rate of pathogenic variant discovery in neurodevelopmental disorders
(Jaganathan et al., 2019; Sanders et al., 2020). These are just a few
emerging algorithms and, in practice, databases serve as vehicles for
querying calculated predictive scores, such as with dbNSFP (Liu
et al., 2020).
Other informatic algorithms directly utilize RNA-seq data in their
analysis, such as LeafCutterMD (Jenkinson et al., 2020) andFRASER (Find RAre Splicing Events in RNA-seq (Mertes et al.,
2021). SpliceSeq is an example of a tool that utilizes “splice
graph” models (Ryan et al., 2012). Tools like SpliceV facilitate
discovery and visualization of splicing events (Ungerleider &
Flemington, 2019). Underlying these approaches lies layers of additional
nuance with variation in alignment strategies (such as variant-aware
aligners) and assembly (Hong et al., 2018). As methods rapidly evolve,
implemented approaches vary substantially across and within labs, which
can adversely impact database resources. Still, it is worth noting that
such splicing analysis not only has diagnostic but also therapeutic
utility; for instance, tumors with identified splicing variants could be
targeted with therapies that inhibit the spliceosome, splicing
regulatory proteins, or aberrant splicing products (Lee & Abdel-Wahab,
2016; Scotti & Swanson, 2016).
Indirect Insights From Allele-Specific Expression &
Nonsense-Mediated
Decay
Another indirect phenomenon into which RNA-seq can provide insight (and
which DNA-only analysis cannot appreciate) is allele-specific expression
(ASE ). The initial definition of ASE was a purely germline one:
preferential expression of one parental allele in a heterozygous
individual (Shao et al., 2019). Fundamentally with RNA-seq, ASE requires
heterozygous proxy variants, often single nucleotide polymorphism
(SNP s), that allow phasing (or at least quantification) of the
relative expression between maternally- and paternally-inherited
chromosomes (Figure 2 ). These proxy-SNPs include common
and rare synonymous, UTR, and missense variants. Importantly, ASE is the
end-result and, in practice, one is frequently inferring loss of a
pathogenic allele by monogenic expression of the wild-type allele.
Processes such as NMD will actively remove transcripts containing
pathogenic mutations.
Interestingly, detection of ASE can lead to discoveries of VUS and
cancer-related genes. One study in individuals with hereditary
pancreatic cancer identified a heterozygous SNP in BRCA2(rs144848) displaying ASE in RNA; the variant was a VUS and the
mechanism behind it was determined to be a truncating mutation leading
to NMD (Tan et al., 2008). NMD is the process whereby a trimeric complex
of proteins degrades mRNA that would otherwise result in
potentially-pathogenic truncated proteins; it has been posited that NMD
also plays a role in the regulation of normal mRNA expression, as
well as regulation of alternative splicing via degradation of splice
variants with premature termination codons (Brogna & Wen, 2009). One
common mechanism in the pathogenesis of cancer is a nonsense, somatic
mutation in a tumor suppressor that leads to NMD and, by means of
haploinsufficiency or dominant negative effects, loss of function.
One key historical example leading to ASE involves measuring
X-chromosome inactivation or “skewing” (a physiological process normal
in females) and genomic imprinting to gain insight into the preferential
selection of variants along the X-chromosome (Shao et al., 2019). In
genetic females, allele-specific expression also reflects the loss of
transcript expression from one allele due to X-inactivation, an
important methylation-driven “dosage compensation mechanism”
(Shvetsova et al., 2019). Historically, X-skewing has been detected
using the HUMARA assay, which takes advantage of differential
methylation at the androgen receptor locus. However, RNA-seq along
expressed X-chromosome genes provides additional insights. One can
directly observe skewing directionality (ex: 95% skewing in X and
preferential expression of wild-type alleles) (Szelinger et al., 2014).
Recognizing that there is natural variability in skewing in healthy
females, information regarding X-inactivation is less discriminatory but
still useful in unexpected ways. For example, in assessing a candidate
variant within the X-linked TAF1 in a boy, our group leveraged
RNA-seq data for the boy’s healthy mother which showed extreme X-skewing
towards the protective wild-type allele and away from our candidate VUS
(Hurst et al., 2018). In other words, RNA-seq was used to evaluate a
model where X-skewing protected the mother. Further sequencing showed
that the de novo event founded with her and was likely selected,
given a history of unsuccessful pregnancies by the boy’s parents. This
example has generalizability and is relevant to other X-linked
disorders, including Alport, Charcot–Marie–Tooth, Fabry, Fanconi,
Fragile X, Hunter, Rett, and Wiskott–Aldrich syndromes (Migeon, 2020).
Outlier Analysis as a Solution to Power
Problems
Case-control studies are feasible for common variants but are far
more difficult to accomplish for rare VUS where the controls
overwhelmingly outnumber the cases. A number of solutions have been put
forth to solve this problem. These include likelihood ratio tests,
burden tests with genetic scores, adaptive burden tests with
data-adaptive weights/thresholds, variance-component tests, exponential
combination tests, normal transformation with trait “winsorization”
(to find a balance between Type I error and statistical power), or a
combination thereof (Auer et al., 2016; Li et al., 2021).
Alternatively, one key solution to address the inherent lack of
statistical power when it comes to analyzing rare variants is to usegene expression outlier analysis. Rare variants are often
associated with extremes of expression, whether over- or
under-expression (X. Li et al., 2017). Variants can be interpreted
through careful integration of DNA data with RNA-seq data from other
patients or from public resources. Cummings et al. (2017) were one of
the first to demonstrate the fact that consideration of such “outlier”
variants in RNA-seq data results in an improved diagnostic rate (35% in
that study, specifically) in patients for whom a molecular diagnosis
cannot be made from DNA-data alone.(Cummings et al., 2017) Gene
expression outlier analysis has often been employed in the field of
cancer genomics to identify cancer drivers for a specific subset of
cancer types or cancer outliers (Alshalalfa et al., 2012; Mori et al.,
2013). Fundamentally, outlier analysis does not actually identify a
mechanism but rather a gene that is fundamentally different from others
expressed in a cohort. As will be discussed later, there is a
fundamental presumption that the comparison cohort is relevant.