2.2 Core-processing
The processing of the input files is handled by five executables which
we refer to as “MeStudio core”. These components match the nucleotide
motifs to the genomic sequence and map them to the corresponding
category, which are extracted from the annotation file. Categories are
defined as follows: i) protein-coding genes with accordant (sense)
strand (CDS), ii) discordant (antisense) strand (nCDS), iii) regions
that fall between annotated genes (true intergenic, tIG), iv) regions
upstream to the reading frame of a gene, with accordant strand (US)
(Figure 1B). The current implementation uses a naive matching algorithm
to map motif sequences to the reference genome. During the matching
stage, each replicon or chromosome gets loaded into and both strands are
scanned for the presence of the motif sequences, which can hold
ambiguity characters. The resulting binary files are then processed by
another executable that is called for the task at hand. MeStudio core
crosses methylated bases positions relative to the reference sequence
start with the previously described features, producing GFF3 files that
serve as input for the final analysis stage. This is a computationally
expensive part of the pipeline in which multiple nested for loops and
calculations are performed. Integrating one motif on a four-contigs
genome (6,973,268 bp, 23,433 GANTC motif matches) took 0m27.116s on a
single AMD Opteron 6380 processor (2.5GHz).