3.1 Genome sequencing and assembly
In order to survey the genome of S. peregrina , 46.8 Gb of raw illumina data were produced, of which 40.4 Gb clean data were retained. A total count of 17 -mer was 28,804,585,532 from short clean reads. The distribution curve of 17 -mer presented an unusual Poisson distribution with two upward convex signals, suggesting high heterozygosity. Given that high heterozygosity is also common in other insects, the second peak was selected to be the main peak. According to the peak 17 -mer depth of 61, the genome size was estimated to be ~472 Mb, which was highly heterozygous (~3.0%) (Additional file 1: Table S3 and Figure S5) (Vurture et al. 2017). Moreover, we performed the heterozygosity analysis using SNP calling implemented in GATK v. 4.1.5.0 (Walkeret al. 2018). The heterozygosity ratio was ~1.65%, which is relatively lower than that of17 -mer analysis (Additional file 1: Table S4).
We generated 58.54 Gb of raw PacBio data. After quality control, 57.83 Gb of subreads were retained for genome assembly. The average length and the N50 of subreads were 8.55 kb and 13.90 kb, respectively (Additional file 1: Table S5). The initial genome assembly was 554.66 Mb in size, with contig N50 of 3.79 Mb and contig number of 2,031, respectively (Additional file 1: Table S6). Finally, the de novo genome assembly was 560.31 Mb in size, with contig N50, the longest contig and contig number of 3.84 Mb, 20.90 Mb and 2,031, respectively (Table 1, Additional file 1: Table S6). Meanwhile, the result of completeness of the assembly indicated that the genome assembly covered 97.9% complete BUSCOs and 97.1% of single-copy BUSCOs, with only 1.4% of missing BUSCOs (Additional file 1: Table S7).
A total of 159.4 Gb of Hi-C raw data were produced consisting of 1,063,074,766 paired-end reads (Additional file 1: Table S8), After quality control, 153.8 Gb of clean data were obtained, containing 96.45% of clean paired-end reads (Additional file 1: Table S9), which were used as input for the Juicer and 3d-DNA Hi-C analysis and scaffolding pipelines. Finally, pseudochromosomes with a total length of 548.19 Mb were exactly anchored into six chromosomes, accounting for 97.76% of the draft assembled genome (Fig. 1) , which is identical to the karyotype of six chromosomes based on cytological observation in S. peregrina (Agrawal et al. 2010) (Fig. 2a , Additional file 1: Table S10). Although the size of the assembled genome is more than twice that of D. melanogaster , six pseudochromosomes in the assembled genome can be aligned nearly against the D. melanogaster genome (Fig. 2b ). The result of completeness of the assembly indicated that the Hi-C genome assembly covered 98.2% complete BUSCOs and 97.4% of single-copy BUSCOs, with only 0.8% of duplicated BUSCOs (Additional file 1: Table S11).