Hierarchical clustering of sequences
Clustering of values from a symmetrical matrix was performed using the UPGMA algorithm and the SciPy library31,32. The UPGMA algorithm uses the Euclidean distance matrix as input. At each stage of the process, the minimum value corresponding to the nearest clusters (A and B) must be found in this matrix. Clusters A and B should be merged into a new cluster C. The columns and rows corresponding to clusters A and B should be replaced with a new column and row from cluster C. This procedure should be repeated until all clusters are merged. During clustering, the UPGMA algorithm builds a rooted-tree (dendrogram). The length of the branches of this dendrogram corresponds to the Euclidean distance between the two nearest clusters.
To separate clusters, you need to set the cophenetic distance. The cophenetic distance is the height of the dendrogram where two branches, including two objects, merge into one branch33. In this work, a cophenetic distance of 70% of the final confluence height on the dendrogram was used.