Case studies: efficiency of USCO extraction
Our results confirmed that mzl-USCOs can easily be retrieved and extracted from published WGS datasets in sufficient number to be useful to infer reliable phylogenies on all systematic levels, from subspecies to phylum. We found the quantity of recovered mzl-USCOs as well as the number of retrieved alignment positions to be similar to those obtained with approaches in which USCOs were obtained by DNA target enrichment (Table 2; see also Dietz et al., 2023). Similar to the results when assembling raw reads from hybrid enrichment libraries (Dietz et al., 2023), results of gaining USCOs from WGS data were heavily influenced by the data extraction method used and the quality of the genomic data (see also Dietz et al. 2023). These factors might pose limits to the possibilities of how genomic USCO sequence data can be generated and combined in future systematic research (see below).
Independently from how Metazoa-level USCO markers were obtained, either through target enrichment (Dietz et al., 2023) or by extraction from WGS raw reads (this study), it is crucial for understanding the robustness and ability to standardize species inference with Metazoa-level USCO markers to clarify if and how differently generated mzl-USCOs can be simultaneously analyzed. This knowledge is important to evaluate the sustainability of the marker system, particularly for the case that a marker-specific database is to be created. In this context, orthology assessment is crucial, as it is used to identify and define the fundamental entities that subsequent tools rely on. Due to the increasing quality and number of available published genome assemblies, the data available from successive versions of OrthoDB is continuously evolving towards a more comprehensive taxon coverage. Consequently, the set of genes defined as mzl-USCOs (i.e., those present in single-copy in at least 90% of known genomes of a given group; in our case, Metazoa) is expected to change, but also to converge over time. This change over time is reflected by the number of single-copy genes present in at least 90% of the genomes of Metazoa in different versions of OrthoDB: it has changed from 978 in OrthoDB v. 9, to 954 in OrthoDB v. 10, to 1,268 in OrthoDB v. 11 (Kuznetsov et al., 2023). This change is likely driven by the steady addition of genomes considered for orthology prediction in OrthoDB (330 in v. 9, 448 in v. 10; 812 in v. 11). While the turnover in which genes are considered mzl-USCOs is currently still high, we expect changes regarding the set of genes flagged as mzl-USCOs becoming smaller with increased knowledge of the evolution of gene families.
Although genes identified as mzl-USCOs were not entirely identical between the different versions of OrthoDB used here, we observed that these changes had minimal impact on the overall size of USCO nucleotide alignments, on alignment completeness, on the topology of inferred phylogenetic trees, on nucleotide sequence variation (SNP) clustering, and on species delimitation results, as long as a sufficiently large number of orthologs are involved and if incomplete alignment sections are removed.
Comparing the results of BUSCO-based to the Orthograph-based extraction of USCOs, we find that the former yields generally longer nucleotide sequences per gene. This is probably because, besides the conserved region of a gene identified by the HMMs, BUSCO also includes flanking regions with a length of 5-20 kpb (Simão et al. 2015). Within these regions, the full gene is then determined by using the gene prediction of the Augustus pipeline (Stanke & Waack, 2003; Stanke et al., 2006) to extend the gene model beyond the conserved region specified by the HMM. In contrast, Orthograph searches only for the conserved region for which the HMM was created. This effect is particularly important since the mzl-USCO HMMs contain only the gene region conserved across all Metazoa. Full length coding regions could also be extracted with Orthograph by using HMMs created specific tor the individual groups, although these would then be not necessarily homologous across all groups anymore.
The additional nucleotide sequence information of data based on BUSCO software seems to impact the phylogenetic signal neither positively nor negatively. Consequently, the results of phylogenetic tree reconstruction, SNP clustering, and species delimitation analysis hardly differ between BUSCO- and Orthograph-based extraction approaches. However, our results show that, even with extensive filtering and manual curation, BUSCO- and Orthograph-derived USCO data should not be analyzed together. The resulting phylogenies obtained with mixed data may be severely misleading. Data pruning improves the results, but large discrepancies may remain, especially when using approaches that analyze concatenated alignments rather than coalescent-based approaches that rely on gene trees for phylogenetic inference. Therefore, we recommend extracting mzl-USCOs in one consistent way. The underlying problem is that alignments of systematically different sequences pose a problem to currently used alignment programs. The ongoing change in sets of genes being classified as USCOs could become a problem in terms of data overlap for the sustainability of the marker system, as datasets generated with older and newer OrthoDB versions might not be fully comparable. Here the overlap between Metazoa-level orthologs from OrthoDB v. 9 and OrthoDB v. 10 was relatively moderate (only 580 out of 978 resp. 954 genes; 59% resp. 61%). This has an impact especially on the possibility to combine future USCO data with already available mzl-USCOs generated with DNA target enrichment using baits which are based on earlier and/or different OrthoDB versions. However, we expect that future versions of OrthoDB will have an increasing amount of overlap of mzl-USCOs between versions, since genes that are present and single copy in 90% of the genomes should converge as soon as all taxonomic groups are covered evenly in OrthoDB.