Phylogenetic analysis of metazoan genomes
We performed phylogenetic analyses with the Metazoa-level USCO
nucleotide sequences to assess their reliability in recovering
phylogenies and classifications. To this end, we analyzed all
orthologous nucleotide sequences of each mzl-USCO gene from all genome
assemblies in which more than half of the loci were recovered as being
complete and single-copy. Nucleotide and amino acid sequences of USCOs
were taken from the output of the BUSCO software. Amino acid sequences
were aligned with MAFFT v. 7.305b (Katoh & Standley, 2013) using the
L-INS-I algorithm. Poorly aligned regions were identified and removed
from the amino acid alignments with ALISCORE v. 2.0 (Misof & Misof,
2009; Kück et al., 2010) and ALICUT v. 2.31 (available from:https://github.com/PatrickKueck/AliCUT),
and outlier sequences were identified and removed with OliInSeq v. 0.9.3
(https://github.com/cmayer/OliInSeq).
Multiple nucleotide sequence alignments based on the amino-acid
alignments were inferred with pal2nal v. 14.1 (Suyama et al., 2006), and
all third codon positions were excluded with a custom Perl script
(Supplementary Material). Maximum-likelihood analyses were performed
with IQ-TREE v. 2.1.2 (Minh et al., 2020) using multiple sequence
alignments of individual genes and concatenated multiple sequence
alignments of all genes, respectively, and analyzing amino-acid sequence
data or nucleotide sequence data with third codon positions removed. For
both the concatenated nucleotide dataset and the concatenated amino-acid
dataset, the best fitting substitution model and partitioning scheme
were inferred with ModelFinder (Chernomor et al., 2016; Kalyaanamoorthy
et al., 2017) and PartitionFinder (Lanfear et al. 2014) as implemented
in IQ-TREE using the full list of models and the IQ-TREE option -m
MFP+MERGE. Data blocks in the partition merging steps were the USCO
genes. For analyzing the nucleotide dataset, we applied the inferred
substitution model and partitioning scheme and performed 50 replicate
maximum likelihood tree searches from random starting trees. We
performed a single maximum likelihood tree search when analyzing the
amino-acid dataset, as performing replicates would have been
computationally unreasonably expensive with respect to the expected
benefit. Branch support was estimated from 1,000 ultrafast bootstrap
replicates (UFBoot, Hoang et al., 2018) as well as approximate
likelihood ratio tests (aLRT) using nearest neighbor interchange (NNI)
as tree rearrangement method. The tree with the highest likelihood was
then chosen among all replicates. The individual gene trees were further
used for a coalescent-based tree analysis with ASTRAL v. 5.6.1 (Zhang et
al., 2018) applying the program’s default settings.
Sequence overlap in multiple sequence alignments was examined using the
concatenated alignment containing all taxa. We calculated with a custom
script (Supplementary Material) the overlap for each pair of
individuals, defined as the number of alignment positions with data in
both individuals, divided by the number of alignment positions with data
in at least one of the two individuals.