Systematics with mzl-USCOs from whole genomes: data recovery of
different extraction methods
Reference genomes of each of the four study groups contained at least
90% of the mzl-USCOs with exactly one copy. We found no consistent
differences in the number of detected mzl-USCOs across the analyzed
individuals (Figure S8) irrespective of what software we used to
identify mzl-USCOs and their copy numbers. In all four study groups, all
mzl-USCOs present in the reference genomes were recovered in at least
some target individuals, and in all specimens, except some Darwin’s
finches, the majority of mzl-USCOs was recovered (Figure S8).
The concatenated multiple nucleotide sequence alignments of mzl-USCOs
extracted with the BUSCO software were more than a million sites long;
the corresponding supermatrices of USCO nucleotide sequences extracted
with Orthograph were on average about 30% shorter (Table 2). The
Orthograph/bwa-based approach was found to consistently miss some
mzl-USCOs in some specimens: the number of mzl-USCOs recovered across
all specimens proved to be consistently lower when using Orthograph for
target gene identification than when using BUSCO (Figure S8). Total
alignment completeness at the nucleotide level exceeded 90% in all
study groups, except in Darwin’s finches with a completeness of
45–52%. Alignment completeness of Orthograph-based datasets was
slightly lower than of BUSCO-based datasets (Figure S9). The number of
SNP sites was higher than 5,000 in all studied taxonomic groups, except
in Darwin’s finches. The number was generally much smaller in the
Orthograph-derived datasets than in the BUSCO-derived ones (Table 2).