2.2 Core-processing
The processing of the input files is handled by five executables which we refer to as “MeStudio core”. These components match the nucleotide motifs to the genomic sequence and map them to the corresponding category, which are extracted from the annotation file. Categories are defined as follows: i) protein-coding genes with accordant (sense) strand (CDS), ii) discordant (antisense) strand (nCDS), iii) regions that fall between annotated genes (true intergenic, tIG), iv) regions upstream to the reading frame of a gene, with accordant strand (US) (Figure 1B). The current implementation uses a naive matching algorithm to map motif sequences to the reference genome. During the matching stage, each replicon or chromosome gets loaded into and both strands are scanned for the presence of the motif sequences, which can hold ambiguity characters. The resulting binary files are then processed by another executable that is called for the task at hand. MeStudio core crosses methylated bases positions relative to the reference sequence start with the previously described features, producing GFF3 files that serve as input for the final analysis stage. This is a computationally expensive part of the pipeline in which multiple nested for loops and calculations are performed. Integrating one motif on a four-contigs genome (6,973,268 bp, 23,433 GANTC motif matches) took 0m27.116s on a single AMD Opteron 6380 processor (2.5GHz).