Calculation of sequence similarities and probability distributions
As a measure of sequence similarity, we computed the pairwise LCS (Longest Common Subsequence) scores between pairs of sequences in the within-species and within-genus datasets using the sumatraprogram (Mercier et al. 2013). Methodological comparisons showed that this algorithm provides an excellent balance between performance and computation efficiency (Jackson et al. 2016, Kopylova et al. 2016, Bhat et al. 2019).As sumatra provides pairwise scores for all possible pairs of sequences, the similarity scores resulting from the within-species dataset were filtered in R (R Core Team 2020) to keep only those representing similarities between sequences of the same species, while the scores resulting from the within-genus dataset were filtered to keep only those representing similarities between different species of the same genus.