Calculation of sequence similarities and probability
distributions
As a measure of sequence similarity, we computed the pairwise LCS
(Longest Common Subsequence) scores between pairs of sequences in the
within-species and within-genus datasets using the sumatraprogram (Mercier et al. 2013). Methodological comparisons showed that
this algorithm provides an excellent balance between performance and
computation efficiency (Jackson et al. 2016, Kopylova et al. 2016, Bhat
et al. 2019).As sumatra provides pairwise scores for all possible
pairs of sequences, the similarity scores resulting from the
within-species dataset were filtered in R (R Core Team 2020) to keep
only those representing similarities between sequences of the same
species, while the scores resulting from the within-genus dataset were
filtered to keep only those representing similarities between different
species of the same genus.