Case studies: phylogeny and nucleotide sequence variation of mzl-USCOs
In all four case studies, most interspecific but also many intraspecific nodes of the inferred phylogenetic trees had high (i.e., > 90) branch support and showed few topological differences between datasets obtained by different USCO extraction methods (Fig. 4; Figures S10–13). We detected few topological differences between trees inferred from concatenation supermatrices and trees inferred by using a multispecies coalescent approach on gene trees.
In the Anopheles gambiae complex, the topology of interspecific nodes in all USCO-based trees (Figure S10) was identical to the published one inferred by applying the maximum likelihood optimality criterion on aligned WGS data (Fontaine et al., 2015), except that we found neither A. gambiae nor A. coluzzii to be monophyletic. However, the topology of Fontaine et al. (2015) differed from the species tree inferred by the same authors from the X chromosome data only. According to Fontaine et al. (2015), the X chromosome-derived tree more likely represents the true phylogeny of the group, because the remainder of the genome exhibits extensive signatures of introgression. The USCO-derived topology suggested the monophyly of all species exceptA. gambiae and A. coluzzii . Monophyly of the latter was also not found in the study by Fontaine et al. (2015) when analyzing SNPs extracted from WGS data applying the neighbor-joining tree inference method. Only in the tree obtained from concatenated data containing mzl-USCOs extracted with Orthograph/OrthoDB v. 9, both species were found to be reciprocally monophyletic. NMDS plots that visualized the similarity in SNPs showed nearly all species as clearly distinct clusters irrespective of the applied USCO extraction method (Figure S14). The only exceptions were A. gambiae and A. coluzzii , forming together a single cluster. Our model-based clustering analyses using STRUCTURE also showed all species with the exception ofA. coluzzii and A. gambiae as separate clusters with some levels of admixture (Figure S11; Supplementary Text).
In the Drosophila nasuta complex, our analyses inferred most species to be monophyletic (Figure 4; S11). These findings are largely consistent with those reported by Mai et al. (2019). (Sub-)species that had not been inferred as monophyletic in our phylogenetic analyses were also not resolved when applying NMDS or STRUCTURE (Figure 4). Otherwise, all (sub)species were clearly distinguishable from each other (Supplementary Text).
Regarding Heliconius butterflies, our phylogenies inferred from analyzing mzl-USCOs largely agreed with the phylogeny published by Martin et al. (2013) (Figure S12). We found only few topological differences between analyses that were based on different data extraction approaches and/or phylogenetic reconstruction methods (see Supplementary Text for details). STRUCTURE (Pritchard et al., 2000) and NMDS revealed clusters that were largely consistent with the topology of the phylogenetic trees, with few exceptions described in the Supplementary Text. Analyses based on the datasets from the three USCO extraction approaches gave very similar results (Fig. 4; Figures S14, 15). Even when allowing STRUCTURE to find more clusters than known (sub)species in the analyzed sample by specifying a K value higher than 5, the clustering never supported more than five clusters, and individuals were always assigned to clusters with a probability of more than 90%. A small amount of admixture was detected between sympatric populations (e.g., those of Heliconius melpomene and H. timareta in Peru).
In Darwin’s finches, the alignment completeness of extracted mzl-USCOs was very low (Table 2). The incompleteness of the Darwin’s finches’ datasets was likely caused by a low sequence coverage (< 10x) and in consequence a poor assembly quality. Therefore, for the analysis of sequence variation we included not only SNPs present in all individuals (as in the other case studies), but also SNPs absent in less than five. Possibly due to the large amount of missing data in the alignments, the inferred phylogenetic trees differed in many details from each other and from the original maximum-likelihood tree based on WGS data (Lamichhaney et al., 2015). Consequently, also NMDS plots of SNP similarity did not provide results that allowed to visually distinguish between different species within the genusCamarhynchus and between most of the species withinGeospiza , except for the species G. difficilis andG. septentrionalis . However, differentiation between genera was clearly visible. SNP clustering with STRUCTURE also did not allow us to distinguish species of Camarhynchus from each other and to distinguish some species of Geospiza from each other (Supplementary Text).