Discussion
In this study we optimized and applied a multiplexed long-read sequencing approach which makes use of high-quality short read exome data to perform routine phasing of de novo mutations. We report the phasing results of 77 DNMs from 64 of our 77 patient-parent trios linked to male infertility, achieving successful phasing for 71% of the 109 DNMs investigated using this long-read targeted approach. In contrast, only 9 of these DNMs (8%) could be reliably phased based on short-read WES data alone.
Short-read exome sequencing has become an increasingly common tool in research and diagnostic of genetic disease, with patient-parent trio-based sequencing routine for the detection of DNMs. With only 8% of DNMs being phasable in our cohort of 77 patients when using short-read WES alone, it is clear that an alternative approach is needed to determine the parent-of-origin and timing of DNMs. Our method uses the long-range PCR with standard optimisation steps to achieve the simplest and quickest large-scale success. PCR is a simple and standard wet lab practice, providing greater enrichment and target specificity than any alternative target-based approach. The sequencing of target enriched long DNA strands with ONT allowed us, in most cases, to acquire 10s of thousands of times coverage per target, with many targets run per flow cell, supporting the projects scale demands. To overcome challenges with error and postzygotic mutations we used the WES data and Sanger validated DNMs to polish the variant analysis, which limited computational demand as no complex algorithms were required, and processing could be quick.
DNMs are known to arise from mutational events occurring during gametogenesis, predominantly during spermatogenesis rather than during oogenesis, which is assumed to be associated to the scale of male gamete production and failure of DNA repair mechanisms which lead to the increased opportunity for mutational events to occur (Aitken & Baker, 2020; Evenson et al., 2020; Grégoire et al., 2013; Haldane, 1947; Kong et al., 2012). Previous literature has shown that DNMs occur on the paternal allele approximately 80% of the time (Kong et al., 2012; Goldmann et al., 2016; Yuen et al., 2016). In agreement with this literature, 83% of all phased DNMs in this study were determined to be of paternal origin.
Parent-of-origin and zygosity information adds another layer to our understanding of potential disease-causing variants. This is important when investigating genetic diseases, especially those that likely have complex and varied mechanisms. In our cohort of 77 patients, 51 patients were confirmed to suffer from non-obstructive severe oligospermic or azoospermic phenotypes. In the original publication related to this work (Oud et al., 2022), we showed that 6 out of the 8 likely causative DNMs identified in these patients were of paternal origin (Supplementary Table 8 and 14, Supplementary Figure 6). This suggests that DNMs with a deleterious effect on the health of an individual can escape negative selection in the paternal germline.
Accurate detection of the DNM allele frequency is critical to differentiate prezygotic from postzygotic mutational events, important in clinical settings for estimating the recurrence risk (Almobarak et al., 2020; Scanga et al., 2021). Our approach yielded a highly accurate allele frequency average of 49.6% in the prezygotic mutations, with an SEM of 0.84% (Supplementary Table 13). Though similar accuracy may be achievable with more computationally demanding methods, the strength of our method lies in utilizing the WES data and DNM validation practices commonly available. This shows that bioinformatic cleaning and more complex haplotype processing steps are unnecessary, with accurate results achievable through simple DNM and DNM-anchored iSNP selection. In total, 8 of the 77 phased DNMs were classified as postzygotic events (10%), largely in agreement with current literature results of 6.5% to 10% (Acuna-Hidalgo et al., 2015; Ye et al., 2018; Sasani et al., 2019), supporting the validity of our method. Interestingly, while there was significant correlation between WES and ONT postzygotic base/allele frequencies, 25% of the postzygotic DNMs could not be determined from WES DNM base frequencies. This demonstrates the importance of combining phasing analysis with deep coverage long-read sequencing to further characterise the timing of DNMs. As can be expected for postzygotic DNMs (Girard et al., 2016), we see less paternal bias even though our numbers are small (5 out of 8 postzygotic DNM are paternal, 62%).
We here use a standard PCR amplicon targeting approach with long-read sequencing, rather than CRISPR-Cas targeting. Despite CRISPR-Cas recently becoming a choice method for long-read targeted sequencing (Hafford-Tear et al., 2019; Liu et al., 2019; Gilpatrick et al., 2020; McDonald et al., 2021), the large number of targets and small target sizes in our cohort would make CRISPR-Cas complex and costly. Standard PCR targeting is optimal for routine application that does not require methylation data, read lengths greater than 10-20 kb, or directly representative read counts (Aird et al., 2011). While the CRISPR-Cas approach can have a 10-100 fold enrichment of the target region compared to standard low coverage long-read WGS, it still results in 95.4 % off-target sequencing (Gilpatrick et al., 2020). This off-target sequencing issue significantly limits the number of samples that can be run per flow cell, and only a single sample can be run if demultiplexing is based on the genomic position of the target. The reverse is seen when comparing this to the standard amplicon approach used herein, where dozens of samples were run per flow cell and no off-target mapping was identified. Based on using the optimal CRISPR-Cas approach of 2-3 gRNAs, and taking into account the reduced sample number per flow cell, CRISPR-Cas methods also have >40 fold increase in cost per target. Nonetheless, CRISPR-Cas target enrichment shows great promise, and will likely be the best approach for targets larger than 10-20 kb. Despite not observing more basecalling error from PCR extension in targets of greater sizes, it is worth considering the potential increases in base error and bias from PCR approaches which would compound the lower accuracy inherent to long-read sequencing. Our data supports the importance of minimizing target region sizes when performing PCR based amplification for targeted sequencing, especially when performing primer optimisation for >100 bespoke primer pairs. Limiting target sizes will reduce labour intensive PCR optimisation and though not observed in our study it may also reduce base error from PCR fidelity issues. We should, however, be mindful that for 11% of the DNMs studied no iSNPs were found within the 5kb window, so minimizing the target region can also negatively impact phasing. For another 18% of DNMs, however, the sequencing data was of insufficient quality for phasing purposes, so clearly a balance must be found between sequencing quality and target size.
Since ONT released the MinION platform in 2014, there have been extensive leaps in advancing both the chemistry and the bioinformatic tools. This has resulted in raw base accuracy moving from as low as ~60% (Loman and Watson, 2015) to the current 92-97% in the 9.4.1 flow cell chemistry used in this investigation. It should be noted that further increases in accuracy have also been suggested in recent flow cell chemistry, such as the release this year of the R10.4.1 flow cell. Bioinformatic tools that include the variant caller ‘Clair’, used here, have also shown increased confidence in variant calls but are thought to be reaching their limit, with greater confidence requiring significant alternative algorithms or improvements in chemistry (Luo et al., 2020). Despite the bioinformatic improvements in base calling and variant calling, we observe that the accuracy of long-read data on long-range PCR products still causes far greater false positives than WES short-read data. After filtering ONT variants by read depth and quality scores, our anchored approach filtered an additional 50% of the remaining variants on average. If false variants that were missed prior to our anchored filtering approach were included in the phasing process it is likely some targets would be phased incorrectly or not phased at all. Many phasing tools such as ‘whatshap’ carry out phasing with the understanding that variants within the vcf file are correct, so the removal of false variants is important.
Our study provides an approach for accurately phasing and parent-of-origin calling DNMs in a set of 77 patients. To our knowledge this is the first time that phasing of DNMs has been investigated on this scale using long-range PCR targeted ONT sequencing, where each sample has a uniquely specific target. We optimized the method for efficiency and streamlined the laboratory and computational pipelines for processing large numbers of DNMs for detailed phasing analysis. We incorporate additional short-read sequencing patient-parent trio data and Sanger validated DNMs that are commonly available from DNM discovery pipelines like ours. This approach enabled us to improve DNM phasing and postzygotic calling. This data-supported and anchored phasing approach can be of great use in both research and diagnostic settings where DNMs are routinely studied and interpreted.