THE ClinICAL UTILITY and diagnostic yield of
rna
While analytical validity refers to the sensitivity, specificity,
and accuracy of a diagnostic test in terms of its ability to measure a
biomarker in a lab setting, clinical validity refers to the
accuracy and predictive value of that test when it comes predicting
clinical diagnosis. Both terms are distinct from clinicalutility , which refers to a test’s ability to make a difference –
that is its potential to impact patient quality of care/life by guiding
clinical decision-making (Byron et al., 2016). Here we will outline the
clinical utility of transcriptome analysis for the diagnosis of both
neoplastic and non-neoplastic diseases.
Mendelian Disorders
Transcriptome analysis is a boon to the diagnosis of rare Mendelian
diseases. Historically, genetic counseling has relied upon whole exome
sequencing to identify causative disease variants; however, this
DNA-only approach has left up to 75% of patients without genetic
diagnoses (Stenton & Prokisch, 2020). When integrated with genome
sequencing – and, especially, in situations when said genome sequencing
encompasses both exons and introns – gene expression profiling has been
shown to significantly boost molecular diagnostic rates; yields have
been shown to increase by 10-35% (Lee et al., 2020; Maddirevula et al.,
2020; Stenton & Prokisch, 2020). This is because RNA data both: (1)
puts any variants identified in DNA into context by revealing their
transcript-level consequences (ex: allele-specific expression due to
nonsense-mediated decay, imprinting, and/or expression of splice
variants), and (2) illuminates phenomena (like gene expression outliers)
that may not pass the threshold of detection in DNA data alone (but that
are crucial to the pathogenesis of a given disease) (Lee et al., 2020).
Gene expression profiling has also improved clinicians’ ability to
diagnose, stratify, and subtype autoimmune diseases, like systemic lupus
erythematosus, as well as degenerative diseases like Age-Related Macular
Degeneration (Alarcón-Riquelme, 2019). Additionally, consideration of
the transcriptomic landscape has shed light on the fact that many of
these diseases are heterogenous with a spectrum of causative molecular
events (Morello et al., 2019). RNA-seq is also capable of overcoming the
“bottleneck of variant interpretation” in patients with inborn errors
of metabolism, mitochondriopathies, and/or unsolved muscle disorders,
leading to significantly increased diagnostic yields (Kremer et al.,
2018; Thompson et al., 2020).
It is important to note that recent studies have shown, particularly in
the case of monogenetic neuromuscular disorders, that blood-based
RNA-seq is not sufficient for diagnosis; however, RNA-seq performed on
myotubes generated by trans-differentiation of patient fibroblastswas capable of identifying a molecular culprit (predominantly
splicing variants) in 36% of patients for whom DNA-only analysis had
failed to do so (Gonorazky et al., 2019). This highlights the fact that
several methodological improvements must be made to hasten the progress
of translating transcriptome analysis from the benchtop to the bedside,
and to enhance diagnostic sensitivity. These include refinement ofex vivo trans-differentiation of accessible cells to more
disease-relevant cell types (Lee et al., 2020).
Hereditary Cancer
Cancer genomic analysis involves the identification of inherited
(“germline”) risk variants and acquired (“somatic”) mutations in DNA
and RNA (Koeppel et al., 2018). Transcriptome analysis has been shown to
be capable of identifying rare, causative variants by revealing changes
in splicing and gene expression that were undetected by DNA sequencing
(Yuan et al., 2020). Since examples of RNA-seq analysis in the
conjunction of cancer risk prediction is more recent, we will dissect
these papers in greater detail. Before we do, two key distinctions
should be made regarding hereditary cancer studies. First, there is a
general bias towards using RNA-sequencing in conjunction with
panel-based clinical sequencing, to reduce the genomic search space
considerably. If one focuses on all oncogenic or tumor-suppressor genes
it changes the prevalence of background events, and ultimately the
precision and/or diagnostic yield. Second, the context of reporting is
distinct from Mendelian studies, a search or diagnostic odyssey.
Typically, hereditary cancer VUS create unique stress, and there is some
implied interpretation of negative findings.
A series of papers from 2019 through 2021 illustrate and give further
insights into these distinctions. First, Conner et al. (2019) found
that, by supplementing DNA genetic testing with RNA, heterozygous
duplication events in MSH2 – which were previously classified as
VUS in five individuals with Lynch Syndrome – were able to be
reclassified as pathogenic or likely pathogenic (Conner et al., 2019).
Similarly, Karam et al. (2019) showed that, by supplementing DNA with
RNA genetic testing in cases suspicious for hereditary cancer in which
the variant in question involved a potential splice site alteration, (1)
inconclusive DNA-based results were resolved in 49 of 56 inconclusive
cases (88%) studied, with 26 (47%) being reclassified as
clinically-actionable and 23 (41%) being clarified as benign; (2) the
study estimated that 2% of patients receiving paired DNA/RNA testing
would benefit by the addition of RNA by further characterization of
splice-site VUS (Karam et al., 2019). Two other studies found that the
addition of transcriptomic analysis to hereditary cancer testing enabled
60% and 20%, respectively, of splicing VUS to be reclassified as
(likely) pathogenic (Agiannitopoulos et al., 2021; Rofes et al., 2020).
Landrith et al. (2020) performed germline RNA-seq to profile 18 genes
(i.e. APC, ATM, BRCA1, BRCA2, BRIP1, CDH1, CHEK2, MLH1, MSH2,
MSH6, MUTYH, NF1, PALB2, PMS2, PTEN, RAD51C, RAD51D, and TP53 )
in patients with suspected hereditary cancer syndromes. The
investigators demonstrated a 9.1% relative increase in the detection of
pathogenic variants afforded by augmenting DNA data with RNA analysis
(Landrith et al., 2020). Deep intronic variants have also been
identified in BRCA1/2, by virtue of RNA analysis, in patients with
familial breast and ovarian cancers (Anczuków et al., 2012; Montalban et
al., 2019).
As is evident from the studies mentioned above, RNA deep intronic
mutations and splicing aberrations are unique mechanisms of
carcinogenesis which, based upon DNA data alone, are still often
classified as VUS (Urbanski et al., 2018). Splicing mutations (which can
be present in both pre-mRNA exons and introns (the latter of which has
historically been harder to detect using traditional DNA analyses) lead
to abnormal mRNA phenomena (e.g. exon skipping, intron inclusion,
cryptic splice site activation) and the production of abnormal proteins
with diagnostic value (Shi et al., 2018). Expression changes in splicing
regulators can be used as biomarkers for cancer diagnosis (ex:hnRNPA2/B1 , an RNA-binding protein involved in mRNA splicing, is
a sensitive and specific early-diagnostic marker of lung neoplasms)
(Zhang et al., 2021). RNA-seq has shown utility in diagnosing germline
splicing variants in hereditary cancer genes that were not evident in
DNA analysis (Urbanski et al., 2018). While splicing variants make up
11% of hereditary cancer gene VUS, they make up 55% of those VUS that
are “likely pathogenic”(Parsons et al., 2019).
Larger-scale reports have been published by clinical genetic companies
where RNA-seq was used in conjunction with panel-based studies across
thousands of individuals. Ambry recently released a series of “RNA Case
Studies” that demonstrate the clinical diagnostic utility of
transcriptomic data, particularly for identifying intronic variants
(AmbryGenetics, 2019). One such scenario was the case of a 33-year-old
male, with a personal and family history of colon polyps, for whom no
clinically-significant variants could be detected via DNA-only analysis.
When genetic analysis was supplemented with transcriptomic analysis
(i.e. Ambry’s +RNAinsight ® panel), however,
abnormal APC transcripts were detected prompting further
investigation via targeted Sanger DNA sequencing. This resulted in the
confirmation of a deep intronic, likely pathogenic variant.
Transcriptomic data enabled the patient’s provider to make a genetic
diagnosis of familial adenomatous polyposis.(AmbryGenetics, 2019) Other
examples include a likely pathogenic intronic variant that was
identified outside of DNA analytical range in the gene ATM(c.497-2661A>G), and exon skipping variants in MSH6leading to Lynch Syndrome. Ambry’s +RNAinsight ®panel, mentioned in the 2 cases above, analyzes 91 cancer driver genes,
and can be paired with most DNA panels; it has shown to be capable of
reclassifying >70% of VUS (AmbryGenetics, 2021).
Similarly, a recent study by Invitae aimed to exemplify the utility of
RNA analysis for reclassifying splicing VUS (Truty et al., 2021). The
investigators analyzed a significantly large sample consisting of nearly
700k patients from a clinical cohort plus individuals from two large
public datasets (i.e. ClinVar and Genome Aggregation
Database/gnomAD ) (Truty et al., 2021). In their clinical
cohort, Invitae found that 5.4% of individuals had at least one
splicing VUS (most of which were identified outside of essential splice
sites), and that splicing variants represented 13% of all variants
classified as (likely) pathogenic or VUS. They estimated that, in the
clinical cohort, RNA analysis would be capable of
clarifying/reclassifying splicing VUSs in 1.7% of cases. In comparison
to the clinical cohort, in ClinVar and gnomAD , Invitae
observed that splicing VUS comprised nearly 5% and 9% of reported
variants, respectively. Invitae concluded that, in all 3 cohorts,
individuals would have a tangible, clinical-diagnostic benefit from RNA
testing (Truty et al., 2021).
Not only can transcriptome characterization classify VUS as (likely)
pathogenic, but it can also clarify variants as benign . For
example, RNA data supported a variant downgrade of a likely
pathogenic splice site variant at a canonical splice site (Shamseldin et
al., 2021). In the case of CDH1 c.387+1G>A, various
clinical laboratories initially reported the variant in multiple
Hispanic/Latino patients as “likely pathogenic” on the basis of the
“+1” position of the variant. This led to the diagnosis of hereditary
diffuse gastric cancer syndrome, a condition requiring complex
management because of its association with a very high risk of early
onset gastric cancer and lobular breast cancer. However, the variant was
studied in more detail because the patients with this variant lacked the
associated phenotype of the condition. The variant was experimentally
demonstrated to result in the activation of a cryptic in-frame donor
splice site, leading to the recommendation by ACMG and AMP that variants
at this position not be considered as likely pathogenic (Maoz et al.,
2016).
In large part, we have limited this review to germline-inherited
variation due to space and scope. However, clearly, RNA-sequencing has
utility in the context of somatic variation, and, in fact, this can be
the basis of treatment decisions. It is worth highlighting that a 2021
study in Oncogene examined somatic variation across over 1,000
pan-cancer, paired whole genomes and transcriptomes to understand the
role of splicing mutations in tumorigenesis. The investigators
identified about 700 somatic intronic mutations; nearly half were within
deep intronic regions and, of those, 38% activated cryptic splice
sites. A subset of the deep intronic mutations resulted in splicing
enhancers or silencers alterations. They found that intronic mutations
often affected tumor suppressor genes, and those hematological
malignancies, particularly, harbor many deep intronic mutations. Taken
as a whole, this paper suggests considerable insights can be gained well
beyond germline analysis of VUS (Jung et al., 2021).
Limitations & Future
Directions
The progress of RNA-based diagnostics is encouraging, especially as new
and translational gene expression profiling techniques emerge (Wang et
al., 2020). Gene expression profiling allows for, not only, the
identification of fusion transcripts, but also the detection of
phenomena like differential expression, ASE, alternative splicing, and
the presence of non-coding RNAs (Conner et al., 2019). Both targeted RNA
microarrays and RNA-seq have shown analytical validity when it comes to
diagnostics for pediatric, adolescent/young adult, and adult patients
(Vaske et al., 2019).
Conflicting Lines of
Evidence
One fallacy of reasoning – commonly and erroneously applied to the
analysis of variant lists such as variant call format (VCF )
files – is the assumption that the absence of a transcript
variant means that the variant is absent from the specimen. This common
misconception lead to the development of genomic VCFs (gVCFs )
which call every position – both variants and wild type/reference.
The only way to move forward with statistical power and confidence is
through collaborative efforts and the creation of diverse and devoted
databases. ClinVar (Rehm et al., 2017) and gnomAD(Karczewski et al., 2020) are under-appreciated summary-level datasets.gnomAD ’s focus on categorizing rare events was foundational. At
the RNA-level, this approach has not yet been adopted outside of
isolated cases; burgeoning examples are RNAcentral (a database of
non-coding RNAs) (Petrov et al., 2015) and SpliceDB (a database
of canonical and non-canonical mammalian splice sites) (Burset et al.,
2001).
With the clinical implementation of any new “translational”
technology, one must approach variant curation and interpretation of
functional evidence with caution. Interpretation can be more complex
than anticipated; there are many potential pitfalls. For example, Nix et
al. once posited that a partial exon-skipping mutation identified inBRCA2 was pathogenic; it was later found to occur in many healthy
controls (Mundt et al., 2017).
Differences in RNA-seq Library Preparation & Analysis
Methods
Unlike genomic sequencing of DNA, differences in collection methods,
library preparation, tissue sources, etc. massively impact RNA-seq
analysis and interpretation. The first and most apparent variable is the
tissue source for RNA and its relevance to the disease or phenotype. For
example, how well can RNA from whole-blood provide insights into
neurological disorders? GTEx provides an initial framework to evaluate
this question showing typically >40% of genes expressed at
reasonably high levels, and experiences reviewed in previous sections
frequently faced a similar question (Consortium, 2013). Likely,
customized assays leveraging enrichment may increase this dynamic range
of RNA species, recognizing many genes will not have the expression
needed for interpretation via RNA-seq. Nonetheless, many of the studies
highlighted showed >10% improvement in diagnostic yield
despite such changes.
Without question, the ability to look across rare DNA variation across
thousands of individuals, such as through resources like gnomAD, has
profoundly influenced the interpretation of genomic variants.
Aggregation of RNA - even within the same lab will face significant and
un-ignorable challenges. As has been experienced by consortiums and
labs, aggregation of RNA-seq across samples, studies, and library preps
typically recapitulates multiple technical variables to drive the
largest proportion. Efforts to normalize or adjust to these technical
differences are an active area of research beyond the scope of this
review.
Even still, when examining consortiums such as PsychENCODE (Psych et
al., 2015) and AMP-AD (Hodes & Buckholtz, 2016), among others,
eliminating technical variation from RNA-seq experiments is challenging,
particularly if one is interested in rare events. To illustrate this
point, we consider the recent release of 4,871 longitudinally-collected
samples from 1,570 clinically-phenotyped individuals from the
Parkinson’s Progression Marker Initiative (PPMI), conducted using random
priming for PaxGene collected whole-blood with paired whole-genome
sequencing (Craig et al., 2021). Forthcoming efforts from TopMED will
utilize the same PaxGene whole-blood protocols but will differ in using
mRNA-seq from poly-A priming. These two methods lead to different
species with random priming, showing pre-spliced RNA and
non-polyA-tailed transcripts. Algorithms trained on these methods will
fundamentally differ in their core measures, such as PSI. Even within
the same dataset, we have observed significant differences in gene/exon
usage that depended on read lengths of paired 100bp vs. a 125bp subset.
While daunting, solutions are emerging for aggregating RNA such as
through the ARCHS4 aggregation across mouse and human
RNA-seq studies (Lachmann et al., 2018). Other examples include in-house
solutions or those specific to a given group; it becomes a question of
sensitivity. Our group successfully employed outlier analysis to
identify causative variants in a cohort collected over 5 years that was
sequenced by different labs using different methods.
Fragmentation of RNA-seq Databases and
Standards
Though the RNA-based diagnostics described here have potential, there
are still obstacles that must be overcome before they will be
incorporated into routine clinical practice. These challenges include
the need for scientific rigor, reproducibility, accuracy, precision,
clinical validity, and clinical utility. Standards must be created for
test thresholds and normalized reporting, and databases must be
established (Tahiliani et al., 2020; Wang et al., 2020). These databases
must be designed so as to not fall prey to any logical fallacies (ex:
the “marker-positive fallacy”).
Issues of database size, diversity, and representation (both in the
sense of race/ethnicity and cases/controls), population structure, and
cryptic relatedness must be considered (Update., 1996). We must also
acknowledge, and attempt to address, limitations (ex: the
half-life/stability of RNA) and potential confounders (e.g. temporal
changes in RNA expression, differences in RNA capture from fresh frozen
vs. formalin fixed paraffin embedded samples, and phenomena like clonal
hematopoiesis of indeterminate potential in liquid biopsies) (Wang et
al., 2020).
Investigators must carefully consider the tissue from which they are
isolating RNA given the fact that expression patterns differ across
tissues (and, on the circadian-level, RNA expression can even differ in
the same tissue at different time points) (Maddirevula et al., 2020). It
is important to balance preference for minimally-invasive techniques
with considerations of differential tissue expression. One recent study
found that, when comparing brain vs. blood vs. human B-lymphoblastoid
cell lines (LCL ), LCLs possessed isoform diversity for
neurodevelopmental genes similar to that of brain tissue; LCLs also
expressed these genes more highly compared to blood (Rentas et al.,
2020). The authors of this paper described an RNA-seq pipeline with 90%
sensitivity and claimed that findings in LCLs outperformed those in
blood and had implications for the molecular diagnosis of
>1000 genetic syndromes (Rentas et al., 2020).
Another limitation is the fact that expression quantitative trait loci
(eQTL ) databases – like GTEx Portal – are limited to common
variants (i.e. variants with a minor allele frequency
>1%). This means that such datasets are not applicable
toward understanding VUS which, although rare in the general/overall
population, disproportionately impact Non-White/European groups. RNA
analysis is also limited by the fact that most tools utilize transcripts
defined by a Gene Transfer Format (GTF) file and find it difficult to
annotate the 3′ untranslated region (3’ UTR) (Shenker et al., 2015).
Therefore, there is a critical need for more rigorous, reproducible, and
representative RNA databases and tools.
VUS as a Manifestation of Cancer
Disparities
One anecdotal trend that we have noticed within our own group and across
collaborative efforts is that RNA data allows for the identification of
previously missed variation particularly in individuals of non-European
ancestry. For example, in Human Mutation we reported a variant
within 3bp of the exon boundary using an outlier approach in individuals
of African ancestry. The molecular consequences of this variant included
exon skipping, altered isoform usage, and loss of canonical isoform
expression – events not evident in DNA data alone (McCullough et al.,
2020). Patients who self-identify as Hispanic/Latinx, Black/African, and
Asian/Pacific Islander experience more advanced stage disease at time of
screening, significantly lower diagnostic yields, and higher rates of
VUS and variant reclassification compared to their European/Caucasian
counterparts (Dutil et al., 2019; Kinney et al., 2018; Kowalski et al.,
2019; Marco-Puche et al., 2019; Ndugga-Kabuye & Issaka, 2019; Roberts
et al., 2020; Slavin et al., 2018; Urbina-Jara et al., 2019).
Individuals from non-European populations will have more private
variation for one of three reasons: (1) they are poorly represented in
reference datasets, (2) they have greater African ancestry, or (3) they
come from a population that has undergone recent expansions (ex:
Bangladesh) (Halperin et al., 2017).
A recent study reported by Ambry Genetics found that theirBRCAplus , BreastNext , and CancerNext panels yielded
≈2-3x fewer VUS for Non-Hispanic whites than for minority populations
(AmbryGenetics, 2017). Another study reports VUS frequencies in the
tumor suppressor genes BRCA1/2 to be 4.4% in Caucasians, 8.9%
in African Americans, and 8.0% in Hispanic/Latinos; for larger
hereditary cancer panels, this study reported VUS frequencies of 22.1%
in Caucasians, 30.3% in African Americans, and 24.9% in
Hispanics/Latinos (Appelbaum et al., 2020).
One important distinction to make here is the difference between
race/ethnicity and genetic ancestry. While race and ethnicity are social
constructs, ancestry is a biological/genetic construct resulting from
human migrations throughout history resulting in biogeographical genetic
variation (Batai et al., 2021). An example of how genetic ancestry can
further clarify race/ethnicity-based disparities is the fact that higher
African ancestry in Hispanic/Latinos (who are typically “admixed” with
genetic contributions from African, European, and American Indian aka
Native/Indigenous American ancestries) is associated with more
aggressive breast cancer subtypes and a greater likelihood of receiving
inconclusive VUS during genetic testing (Chapman-Davis et al., 2021;
Dutil et al., 2019; Kinney et al., 2018; Kowalski et al., 2019;
Marco-Puche et al., 2019; Ndugga-Kabuye & Issaka, 2019; Roberts et al.,
2020; Slavin et al., 2018; Urbina-Jara et al., 2019; Virlogeux et al.,
2015). Gene expression profiling may be able to help shed light on and
alleviate these inequities (Frésard et al., 2019; Wai et al., 2020).
Conclusions
VUS cause significant psychological distress to patients and
disproportionately limit the promise of precision medicine for minority
patients (Landry et al., 2018). RNA data provides critical answers to
the question of VUS, particularly in terms of clarifying deep intronic
and splicing variants as pathogenic vs. benign. This necessitates the
development of more rigorous, reproducible, and representative RNA
databases and analytical tools.