Best place for figure 1

Liftover to enable compatible genetic variant description formats

The MECP2 genetic variant descriptions from the different sources were made compatible and therefore comparable by application of the HGVS nomenclature and the same reference sequence. This is the first step to make the data interoperable. For this, we used the reference sequence for chromosome 23 (X) NC_000023.11, which is part of the current human genome reference assembly (GRCh38). Genomic descriptions were used to ensure that variations in and outside the gene region (exonic, intronic, up- and downstream) were included. The process of re-describing all variants with the HGVS nomenclature using the same reference build, liftover, was done by using the Mutalyzer position converter webtool [https://mutalyzer.nl/] (Wildeman, van Ophuizen, den Dunnen, & Taschner, 2008). Mutalyzer can perform a conversion between different reference sequences and categories (e.g. complete genomic regions NC and mRNA NM) but requires nomenclature compliant input. Manual correction was performed on genetic variant descriptions that did not have the complete and correct format for conversion but provided enough information to correct the format.

Creation of phenotype annotated collections

Genetic variants were assigned by their linked phenotype information to three different categories: 1. RTT causing (verified by identification as disease causing variant according to the requirements of the databases), 2. benign (verified by finding them in a healthy control subject), and 3. unknown evidence (only pathogenicity prediction scores provided by database). These lists are collected and used for further analysis.

Data FAIRification

We made the prepared genetic variant and phenotype data more Findable, Accessible, Interoperable, and Reusable for humans and computers following the FAIR guiding principles (Wilkinson et al., 2016). The data was made machine-readable (in RDF format) using a semantic data model (see below) and a general-purpose FAIRifier tool (Thompson, Burger, Kaliyaperumal, Roos, & Bonino da Silva Santos, 2020) based on the OpenRefine data cleaning and wrangling tool (http://openrefine.org/) and an RDF plugin (https://github.com/stkenny/grefine-rdf-extension). Similarly, machine-readable metadata (information about the data) was generated using the Metadata Editor (Thompson et al., 2020). The machine-readable metadata was made available on a FAIR Data Point ((Bonino da Silva Santos et al., 2016) https://github.com/FAIRDataTeam/FAIRDataPoint-Spec) available via:http://purl.org/biosemantics-lumc/rettbase/fdp. The FAIR Data Point metadata provides URIs that resolve to the RDF and CSV files for each of the nine sources on Figshare (https://doi.org/10.6084/m9.figshare.c.4769153.v1).
We applied and extended the semantic data model of a genetic variant described in (Horst; et al., 2015) to convert the prepared data to RDF. The model is available on GitHub (https://github.com/LUMC-BioSemantics/rett-variant) and describes the important data elements of the datasets: 1) the genetic variant: HGVS nomenclature, start/end position of the variation, and genome build, and 2) the phenotype information that describes whether a variant is thought to be RTT causing, benign or unknown.

Downstream analysis

Network analysis of data distribution in RTT databases

To analyse the distribution of MECP2 variations in the RTT databases a network was created where the nodes represent databases and the node size the number of available MECP2 variations. The thickness of the lines connecting the databases indicate how many MECP2 variations they share. Network visualization and analysis software Cytoscape (Shannon et al., 2003) was used for this purpose.

Variant annotation and characterization by genomic features

To characterize all the collected MECP2 variants, we developed an automatic analysis pipeline for variant annotation. We used the HGVS corrected variants to integrate custom scripts with HGVS conversion tool fromhttps://github.com/counsyl/hgvsand generated VCF files for annotation within an automated pipeline available athttps://github.com/mbosio85/HGVSparse. Afterwards, we proceeded to annotate variants with Ensembl Variant Effect Predictor, VEP, (McLaren et al., 2016) v94 using the GRCh38 assembly, selecting all available features, plus optional plugins to estimate variant pathogenicity (i.e., PolyPhen (Adzhubei et al., 2010), SIFT (Sim et al., 2012), MetaLR (Dong et al., 2015), CADD (Kircher et al., 2014), FATHMM-MKL (Shihab et al., 2015) from dbNSFP and dbscSNV scores (Liu, Wu, Li, & Boerwinkle, 2016)) both in coding and splicing regions.
The resulting VEP annotated data was processed with R scripts, available athttps://gitlab.bsc.es/mbosio85/rtt_summary_plot, to compare RTT causing and benign variants as subsets, and to generate summary statistics for these. The scripts allow to compare and visualize the two classes in terms of any of the available VEP annotation features, (e.g. variant frequency in the population, estimated variant consequence, and conservation score of the genomic location). Using this we compared the two datasets of RTT causing and benign variants by pathogenicity scores, impact (i.e. estimation of the consequence of each variant on the protein sequence), variant frequency, and genomic location. Because a few variations appear both as RTT causing and benign, we represented this subset of variants as a third class (“both”) in all visualizations.
Finally, we focused on exonic missense variants and used VEP information about the amino acid change and position within the MECP2-e2 transcript to visualize the variation distribution across protein domains and conserved regions (as described in (Lombardi, Baker, & Zoghbi, 2015)). This allowed us to make a finer characterization of differential distribution of RTT causing and benign variants across MECP2 domains.