Figure captions
Figure 1. Symmetric matrix of the average PID values. The matrix contains 60 sub-clusters of Hsp60 sequences from 19 phyla. The X- and Y-axis items “Sub-cluster” are represented in the following format “Phylum #sub-cluster (number of sequences in a sub-cluster)”. Sub-clusters of Viruses have no phylum labels. The Y-axis “Kingdom” represents sub-clusters united by a higher taxonomic rank (Kingdom). The black frames and the X-axis “Cluster” show four clusters and their numbers (Roman numerals), which were obtained by clustering 60 sub-clusters. Clustering was performed using the UPGMA algorithm.
Figure 2. Heat map showing normalized average PID values between Hsp60 sequences belonging to sub-clusters Arthropoda #2, Arthropoda #3, and Nematoda #1 and Hsp60 sequences belonging to each of 19 phyla. The average PID values were normalized using the min-max normalization method. The 19 phyla were sorted by NCBI Taxonomy.
Figure 3. The average amino acid composition of the Hsp60 sequences from 19 phyla: a - Heat map displaying normalized average PID values between 19 phyla of Hsp60 sequences; b - The average amino acid composition of the Hsp60 sequences for each of the 19 phyla compared to the corresponding proteomic values. In Figure 3a the average values were normalized using the min-max normalization method. The line “Summary” presents the average normalized amino acid composition of Hsp60 for 19 phyla. In Figure 3b the amino acid profiles were represented as the average amino acid composition of Hsp60 for each of 19 phyla compared to the average amino acid composition of the respective proteomes. The structure of color scale is as follows: Higher/Lower - the amino acid content in Hsp60 is higher/lower than in proteomes, respectively; Comparable - the amino acid content in Hsp60 is comparable to the average proteomic value. The “Summary” line shows the average amino acid profile of Hsp60 for 19 phyla. The groups were sorted using the NCBI Taxonomy. Amino acids were sorted using an average amino acid composition of 19220 Hsp60 sequences.
Figure 4. The average nucleotide composition of the Hsp60 genes from 17 phyla: a – The average total GC contents at each positions of codon of Hsp60 sequences and corresponding genomes; b - The average content of GC1, GC2, and GC3 in Hsp60 genes. Phyla were sorted by average total GC content of Hsp60 sequences. Student’s t-test was used to compare the average GC content of the Hsp60 sequences and the average GC content of the corresponding genomes. The difference between two independent samples of GC values is considered statistically significant if the p-value is less than 0.05. Statistically indistinguishable average GC values are marked with “ns” (non-significant).
Figure 5. Neutrality plots (GC1,2 vs. GC3) for Hsp60 genes from 17 phyla. The GC1,2 values represent the average GC content at the first and second positions of codon (GC1 and GC2), while GC3 values represent the GC content at the third synonymous codon position. The solid line represents the linear regression of GC1,2 versus GC3, the correlation of which is described by the regression coefficient R and its p-value. The correlation coefficient R reflects the strength of the impact of GC3 on GC1,2. The p-value characterizes the significance of R. Changes in GC3 values actually affect the GC1,2 values when the p-value of R is less than 0.05. In turn, changes in GC3 are considered random, and the R coefficient is not irrelevant when the p-value is greater than 0.05, i.e. GC3 and GC1,2 values are not correlated. The slope ε of the regression line indicates the neutrality of the codon usage. Neutrality values were determined by equation [ε × 100, %]. Slope values ranging from 0 to 1 were calculated using the least-squares regression analysis. The dashed line is a complete neutrality plot, which reflects the complete equilibrium of the nucleotide composition of the gene/genome with directional mutation pressure. The equilibrium point Ep was defined as the intersection point of the neutrality plot (regression line) and the complete neutrality plot. The Epvalue reflects the GC3 content of the gene/genome when the mutation frequencies (AT→GC and GC→AT) are equal. The direction of the mutational pressure, indicating an imbalance in the frequencies of the AT→GC and GC→AT mutations, was determined in accordance with the following conditions: the average GC content value less than theEp value reflects the AT mutational pressure; the average GC content value greater than the Ep value reflects the GC mutational pressure. Phyla were sorted by average total GC content of Hsp60 genes.
Figure 6. Neutrality plot for Hsp60 genes of Chordata
Figure 7. Nc-plots of codon usage bias in Hsp60 genes from the 17 phyla. Gray scatter plots represent ENC values versus GC3 content for Hsp60 genes from 17 phyla. The black bell-shaped curves represent the expected effective number of codons (ENCexp ), i.e. predicted ENC values if codon usage bias is influenced by GC3 content (GC content at the third synonymous position of codons) in the Hsp60 gene only. Phyla were sorted by the average total GC content of Hsp60 genes.
Figure 8. Clustering of ENC values for Hsp60 genes from Chordata. Cluster #1 includes the ENC values of Hsp60 genes of Mammalia, Aves, Reptilia, and Amphibia. Cluster #2 includes ENC values of Hsp60 genes of Fish. Clustering was performed using the value of the GC3 content corresponding to the equilibrium pointEp , which was determined earlier (see GC-content and mutation pressure for codon usage).
Figure 9. Nc-plot of the average codon usage bias in the Hsp60 genes of 17 phyla. The plot space was divided into six quadrants using the ENC and GC3 thresholds. The ENC thresholds, reflecting the level of the Hsp60 gene expression, were as follows: ENC < 40 for genes with high expression; 40 < ENC ≤ 55 for moderately expressed genes; ENC > 55 for low expressed genes. The GC3 thresholds reflecting the direction of the mutational pressure were as follows: GC3 < 0.5 represents the AT-mutation pressure; GC3> 0.5 represents the GC-mutation pressure. The average ENC values were grouped according to the obtained quadrants: Group 1 (Apicomplexa, Firmicutes, and Bacteroidetes); Group 2 (Chlamydiae, Streptophyta, Nematoda, Mollusca, Cyanobacteria, and Chordata); Group 3 (Euryarchaeota, Arthropoda, and Euglenozoa); Group 4 (Ascomycota, Proteobacteria, Basidiomycota, Chlorophyta, and Actinobacteria). The black bell-shaped curve represents the expected effective number of codons (ENCexp ), i.e. predicted ENC values if the codon bias is influenced only by the GC content at the third synonymous position of codons (GC3) in the Hsp60 gene. The horizontal dashed lines (ENC=40 and ENC=55) indicate ENC thresholds for determining the codon usage bias and gene expression level. The vertical dashed line indicates the GC3 content of 0.5. The position of the ENC value regarding this line indicates the direction of the mutation pressure affecting the Hsp60 genes (depicted by arrows).
Figure 10. Symmetric matrix of p-values of the t-test between the ENC values of the Hsp60 genes from 17 phyla. The statistically indistinguishable ENC values of Hsp60 genes of the two phyla, having a t-test with a p-value greater than 0.05, are marked in black. Statistically different ENC values of the Hsp60 genes of the two phyla, having a t-test with a p-value less than 0.05, are marked in white. The black frames and the “Cluster” X-axis represent the four clusters and their numbers (Roman numerals). Clustering was carried out using the UPGMA algorithm.
Figure 11. Average values of relative synonymous codon usage (RSCU) for Hsp60 genes from 17 phyla. The RSCU values for each of 17 phyla were divided into two main groups according to the type of base at the third synonymous position of codon: a - A/T-ending codons; b - G/C-ending codons. Phyla were divided into three groups by the direction of the mutational pressure (see Table 2) and sorted by the average total GC content of Hsp60 genes. The codons were sorted by the average RSCU value between 17 phyla.
Figure 12. Summary patterns of relative synonymous codon usage for Hsp60 genes being under the different mutational pressure. The average RSCU values of Hsp60 genes from phyla with the AT, AT/GC, and GC mutational pressure were divided into two main groups according to the type of base at the third synonymous position of codon: a - A/T-ending codons; b - G/C-ending codons. The codons were sorted by the average RSCU value between all 17 phyla.