Case studies using mzl-USCOs from whole genome sequences: data extraction
To investigate the usefulness of mzl-USCOs to resolve species boundaries in recent radiations and to assess the practicability of the data extraction and assembly pipelines that we developed and applied, we analyzed mzl-USCOs obtained from raw reads of WGS data sets of species of four well-studied radiations: Heliconius butterflies, Darwin’s finches, Anopheles mosquitoes, and Drosophila fruit flies (Table 1; Table S2). Each of these four case studies included multiple specimens of each involved species. The WGS raw reads were downloaded from NCBI. To assemble genomic raw reads to individual USCOs, we extracted mzl-USCOs (Eberle et al., 2020; Dietz et al., 2023) from one selected fully assembled and annotated genome per study group (Table 1) and then used each gene to map the raw reads of each individual onto it (see below).
One rationale for prioritizing USCOs over other genomic nuclear markers (Eberle et al., 2020) is that they allow us to build a comprehensive database in which USCO data referring to different taxonomic groups are stored. This data can be obtained at different times (i.e., with different ortholog sets) and with different data extraction approaches (e.g., DNA target enrichment, WGS; Eberle et al., 2020; Dietz et al., 2023). To evaluate the data yield and ability to resolve species-level relationships with different extraction approaches and genome reference systems (Zdobnov et al., 2017; Kriventseva et al., 2019), mzl-USCO nucleotide sequences were extracted from the reference genomes of the four case studies with three different methods. In the first approach, exonic nucleotide sequences of USCOs were extracted from the assembled genomes with the BUSCO program v. 4.0.6 (Simão et al., 2015; Manni et al., 2021) using the genome mode and the Metazoa dataset from OrthoDB v. 10 (Kriventseva et al., 2019), in the following text referred to as BUSCO data set. In the second approach, Orthograph v. 0.7.1 (Petersen et al., 2017) was used with HMMs from OrthoDB v. 9 (Zdobnov et al., 2017), in the following text referred to as OrthoDB v. 9 data set. For this, we downloaded the official gene sets (OGS) of all species included in the Metazoa OrthoDB v. 9 dataset from the OrthoDB site and the HMMs and information files for that dataset from the BUSCO website (https://busco-archive.ezlab.org/v3/). We used these to create an SQLite database with Orthograph, which was used together with the HMMs from BUSCO to extract the respective USCO nucleotide sequences from the coding sequences (CDS) of each taxon’s OGS using Orthograph with its default setting. Our methodology was thus identical to the one used in approach A2 by Dietz et al. (2023) to assemble USCO raw reads retrieved via DNA target enrichment. The third approach was identical to the second with the one exception that we used OrthoDB v. 10 (https://busco.ezlab.org/busco_v4_data.html) instead of OrthoDB v. 9, in the following text referred to as OrthoDB v. 10 data set.
In all three approaches, nucleotide sequences of single-copy USCOs extracted from the respective genome were used as a reference against which raw reads were mapped with bwa v. 2.1 (Li & Durbin, 2009) using the software’s default setting, except that the minimum seed length was set to 30. Diploid consensus sequences, in which heterozygous sites were represented by an IUPAC ambiguity code, were generated with samtools v. 1.10 (Li et al., 2009) and bcftools v. 1.10.2 (https://github.com/samtools/bcftools). As the nucleotide sequences were aligned to the reference sequence by bwa, no further alignment was necessary. Phylogenetic analyses were done with IQ-TREE v. 2.1.2 (Minh et al., 2020) using a supermatrix of the concatenated nucleotide sequences (positions with missing data or gaps were not removed at this point). The substitution model and partitioning schemes were chosen as described above, and 50 replicate analyses were performed for each dataset. With the same method, we performed phylogenetic analyses based on the nucleotide sequence alignment of each individual USCO and used the resulting trees as input for a multispecies coalescent analysis with ASTRAL v. 5.6.1 (Zhang et al., 2018). All trees were rooted with the outgroup taxa used in the respective original studies from which the data were taken (Table 1).