Figure captions:
Fig. 1. Histogram showing the number of mzl-USCO gene pairs
analyzed in this study which occur on the same chromosome in a given
proportion of the examined taxa. The histogram shows that the proportion
of genomes in which a gene pair occurs on the same chromosome is
typically rather small.
Fig. 2. Distribution of median distances between neighboring
mzl-USCO genes, in nucleotides divided by genome size. Left: based on
real USCO data across all taxa, right: based on a random selection of
protein-coding genes for each taxon. Lines connect dots belonging to the
same taxon.
Fig. 3. Phylogenetic signal and systematic correlation of
distances between neighboring USCOs with major metazoan lineages. A: PC
axes 1 (left tree) and 2 (right tree) from a PCA on frequencies of size
classes of Metazoa-level USCO distances, mapped onto the Metazoa
phylogeny based on concatenated amino acid sequences. B: Plot of axes 1
and 2 from the same PCA, showing a clustering of major metazoan lineages
(Protostomia and Deuterostomia with unfilled and filled color symbols,
respectively).
Fig. 4. Data yield and results of analyses on mzl-USCOs
extracted from Drosophila WGS reads when applying three different
USCO extraction methods: A: Number of mzl-USCOs recovered per number of
specimens; B: ASTRAL trees based on generated USCO datasets; C: Outcome
of SNP clustering analyses with STRUCTURE; D: NMDS plots of SNP
similarity.
Fig. 5. Species delimitation of the four case studies based on
the programs tr2 and SODA on each data set from the three different
extraction methods. Colored boxes indicate that inferred species
entities match with currently recognized morphospecies.
Figure S1. Proportion of pairwise sequence overlap in the
concatenated alignment of USCO loci between pairs of chromosome-level
annotated metazoan genomes.
Figure S2. Maximum likelihood phylogenetic tree based on
concatenated amino acid USCO sequences of all analyzed chromosome-level
annotated genomes of Metazoa. Numbers above branches are support values
from approximate likelihood ratio tests and ultrafast bootstrapping.
Figure S3. Maximum likelihood phylogenetic tree based on
concatenated nucleotide USCO sequences (codon positions 1 and 2) of all
analyzed chromosome-level annotated genomes of Metazoa. Numbers above
branches are support values from approximate likelihood ratio tests and
ultrafast bootstrapping.
Figure S4. Multispecies coalescent-based phylogenetic tree
based on gene trees of amino acid USCO sequences of all analyzed
chromosome-level annotated genomes of Metazoa. Numbers above branches
are local posterior probabilities.
Figure S5. Multispecies coalescent-based phylogenetic tree
based on gene trees of nucleotide USCO sequences (codon positions 1 and
2) of all analyzed chromosome-level annotated genomes of Metazoa.
Numbers above branches are local posterior probabilities.
Figure S6 . Quotient of median distance between neighboring
mzl-USCOs to the median distance between neighboring randomly selected
annotated protein-coding genes, mapped onto the Metazoa phylogeny based
on concatenated amino acid sequences.
Figure S7. Axes 1 and 2 of a PCA on frequencies of size classes
of distances between neighboring Metazoa-level USCOs mapped onto the
Metazoa phylogeny based on concatenated amino acid sequences (detailed
version with taxon names of analyzed chromosome-level genomes).
Figure S8. Number of mzl-USCOs recovered per number of
specimens when applying different USCO extraction methods.
Figure S9. Proportion of pairwise sequence overlap in the
concatenated alignment of USCO loci between pairs of specimens within
each case study (Anopheles , Drosophila , Heliconius ,
Darwin’s finches) analyzed in the present investigation, sorted by
extraction method (BUSCO, Orthograph + OrthoDB v. 9, Orthograph +
OrthoDB v. 10).
Figure S10. Phylogenetic trees of Anopheles species
inferred with concatenated USCO nucleotide sequences (above) and with
the multispecies coalescent (below) generated with different USCO
extraction methods.
Figure S11. Phylogenetic trees of Drosophila species
inferred with concatenated USCO nucleotide sequences (above) and with
the multispecies coalescent (below) generated with different USCO
extraction methods.
Figure S12. Phylogenetic trees of Heliconius species
inferred with concatenated USCO nucleotide sequences (above) and with
the multispecies coalescent (below) generated with different USCO
extraction methods.
Figure S13. Phylogenetic trees of Darwin’s finches inferred
with concatenated USCO nucleotide sequences (above) and with the
multispecies coalescent (below) generated with different USCO extraction
methods.
Figure S14. NMDS plots showing similarities between specimens
inferred with SNP data of mzl-USCOs for the four study groups based on
datasets generated with different data extraction methods.
Figure S15. Diagrams of STRUCTURE clustering results inferred
with SNP data of mzl-USCOs for the four study groups based on datasets
generated with different data extraction methods.
Figure S16. ML trees of concatenated multiple nucleotide
sequence alignments of 580 genes classified as mzl-USCOs in both OrthoDB
versions v.9 and v.10 and extracted with three methods fromAnopheles genomic data. Trees, from left to right, are based on:
1) all data, 2) data after excluding alignment positions with missing
data and gaps (gaps excluded), 3) a manually corrected alignment
(corrected), and 4) a manually corrected alignment with additional
exclusion of alignment positions with missing data and gaps (corrected +
gaps excluded).
Figure S17. Coalescent-based trees inferred in theAnopheles case study with data from the three USCO extraction
approaches aligned in a single dataset using only those 580 genes
classified as mzl-USCOs in both OrthoDB v.9 and v.10. Trees, from left
to right, are based on: 1) all data, 2) data after excluding alignment
positions with missing data and gaps (gaps excluded), 3) a manually
corrected alignment (corrected), and 4) a manually corrected alignment
with additional exclusion of alignment positions with missing data and
gaps (corrected + gaps excluded).
Figure S18. ML trees of concatenated multiple nucleotide
sequence alignments of 580 genes classified as mzl-USCOs in both OrthoDB
v.9 and v.10 and extracted with three methods from Drosophilagenomic data. Trees, from left to right, are based on: 1) all data, 2)
data after excluding alignment positions with missing data and gaps
(gaps excluded), 3) a manually corrected alignment (corrected), and 4) a
manually corrected alignment with additional exclusion of alignment
positions with missing data and gaps (corrected + gaps excluded).
Figure S19. Coalescent-based trees inferred in theDrosophila case study with data from the three USCO extraction
approaches aligned in a single dataset using only those 580 genes
classified as mzl-USCOs in both OrthoDB v.9 and v.10. Trees, from left
to right, are based on: 1) all data, 2) data after excluding alignment
positions with missing data and gaps (gaps excluded), 3) a manually
corrected alignment (corrected), and 4) a manually corrected alignment
with additional exclusion of alignment positions with missing data and
gaps (corrected + gaps excluded).
Figure S20. ML trees of concatenated multiple nucleotide
sequence alignments of 580 genes classified as mzl-USCOs in both OrthoDB
v.9 and v.10 and extracted with three methods from Heliconiusgenomic data. Trees, from left to right, are based on: 1) all data, 2)
data after excluding alignment positions with missing data and gaps
(gaps excluded), 3) a manually corrected alignment (corrected), and 4) a
manually corrected alignment with additional exclusion of alignment
positions with missing data and gaps (corrected + gaps excluded).
Figure S21. Coalescent-based trees inferred in theHeliconius case study with data from the three USCO extraction
approaches aligned in a single dataset using only those 580 genes
classified as mzl-USCOs in both OrthoDB v.9 and v.10. Trees, from left
to right, are based on: 1) all data, 2) data after excluding alignment
positions with missing data and gaps (gaps excluded), 3) a manually
corrected alignment (corrected), and 4) a manually corrected alignment
with additional exclusion of alignment positions with missing data and
gaps (corrected + gaps excluded).
Figure S22. ML trees of concatenated multiple nucleotide
sequence alignments of 580 genes classified as mzl-USCOs in both OrthoDB
v.9 and v.10 and extracted with three methods from genomic data of
Darwin’s finches. Trees, from left to right, are based on: 1) all data,
2) data after excluding alignment positions with missing data and gaps
(gaps excluded), 3) a manually corrected alignment (corrected), and 4) a
manually corrected alignment with additional exclusion of alignment
positions with missing data and gaps (corrected + gaps excluded).
Figure S23. Coalescent-based trees inferred in the Darwin’s
finches case study with data from the three USCO extraction approaches
aligned in a single dataset using only those 580 genes classified as
mzl-USCOs in both OrthoDB v.9 and v.10. Trees, from left to right, are
based on: 1) all data, 2) data after excluding alignment positions with
missing data and gaps (gaps excluded), 3) a manually corrected alignment
(corrected), and 4) a manually corrected alignment with additional
exclusion of alignment positions with missing data and gaps (corrected +
gaps excluded).
Figure S24. Results of species delimitation using tr2 and SODA
in each case study and applying each of the three data extraction
approaches.
Table S1. Metazoan genomes assembled to chromosome level
included in this study, with numbers of single-copy mzl-USCO genes found
in these genomes with the BUSCO software, number of chromosomes, genome
size, median distance between neighboring USCOs, median distance between
neighboring randomly chosen annotated protein coding genes, logarithms
of those two distances, the distances divided by genome size, the
quotient between these distances, p-value based on 10,000 replicates for
the mzl-USCO distance being smaller, adjusted evenness values for
chromosome length, number of coding genes, number of mzl-USCOs,
chi-square values for distribution of mzl-USCOs compared to chromosome
length and to number of coding genes and p-values derived from the
chi-square tests.
Table S2. NCBI accession numbers of the raw reads from
individuals analyzed in the four taxonomic case studies.