Hierarchical clustering of sequences
Clustering of values from a symmetrical matrix was performed using the
UPGMA algorithm and the SciPy library31,32. The UPGMA
algorithm uses the Euclidean distance matrix as input. At each stage of
the process, the minimum value corresponding to the nearest clusters (A
and B) must be found in this matrix. Clusters A and B should be merged
into a new cluster C. The columns and rows corresponding to clusters A
and B should be replaced with a new column and row from cluster C. This
procedure should be repeated until all clusters are merged. During
clustering, the UPGMA algorithm builds a rooted-tree (dendrogram). The
length of the branches of this dendrogram corresponds to the Euclidean
distance between the two nearest clusters.
To separate clusters, you need to set the cophenetic distance. The
cophenetic distance is the height of the dendrogram where two branches,
including two objects, merge into one branch33. In
this work, a cophenetic distance of 70% of the final confluence height
on the dendrogram was used.