3.1 Genome sequencing and assembly
In order to survey the genome of S. peregrina , 46.8 Gb of raw
illumina data were produced, of which 40.4 Gb clean data were retained.
A total count of 17 -mer was 28,804,585,532 from short clean
reads. The distribution curve of 17 -mer presented an unusual
Poisson distribution with two upward convex signals, suggesting high
heterozygosity. Given that high heterozygosity is also common in other
insects, the second peak was selected to be the main peak. According to
the peak 17 -mer depth of 61, the genome size was estimated to be
~472 Mb, which was highly heterozygous
(~3.0%) (Additional file 1: Table S3 and Figure S5)
(Vurture et al. 2017). Moreover, we performed the heterozygosity
analysis using SNP calling implemented in GATK v. 4.1.5.0 (Walkeret al. 2018). The heterozygosity ratio was
~1.65%, which is relatively lower than that of17 -mer analysis (Additional file 1: Table S4).
We generated 58.54 Gb of raw PacBio data. After quality control, 57.83
Gb of subreads were retained for genome assembly. The average length and
the N50 of subreads were 8.55 kb and 13.90 kb, respectively (Additional
file 1: Table S5). The initial genome assembly was 554.66 Mb in size,
with contig N50 of 3.79 Mb and contig number of 2,031, respectively
(Additional file 1: Table S6). Finally, the de novo genome
assembly was 560.31 Mb in size, with contig N50, the longest contig and
contig number of 3.84 Mb, 20.90 Mb and 2,031, respectively (Table 1,
Additional file 1: Table S6). Meanwhile, the result of completeness of
the assembly indicated that the genome assembly covered 97.9% complete
BUSCOs and 97.1% of single-copy BUSCOs, with only 1.4% of missing
BUSCOs (Additional file 1: Table S7).
A total of 159.4 Gb of Hi-C raw data were produced consisting of
1,063,074,766 paired-end reads (Additional file 1: Table S8), After
quality control, 153.8 Gb of clean data were obtained, containing
96.45% of clean paired-end reads (Additional file 1: Table S9), which
were used as input for the Juicer and 3d-DNA Hi-C analysis and
scaffolding pipelines. Finally, pseudochromosomes with a total length of
548.19 Mb were exactly anchored into six chromosomes, accounting for
97.76% of the draft assembled genome (Fig. 1) , which is
identical to the karyotype of six chromosomes based on cytological
observation in S. peregrina (Agrawal et al. 2010)
(Fig. 2a , Additional file 1: Table S10). Although the size of
the assembled genome is more than twice that of D. melanogaster ,
six pseudochromosomes in the assembled genome can be aligned nearly
against the D. melanogaster genome (Fig. 2b ). The result
of completeness of the assembly indicated that the Hi-C genome assembly
covered 98.2% complete BUSCOs and 97.4% of single-copy BUSCOs, with
only 0.8% of duplicated BUSCOs (Additional file 1: Table S11).