4 | DISCUSSION
High-quality genome sequences are critical for biological research studies that focus on chromosomal structure and gene rearrangement, among others. Despite recent advances in sequencing technologies, many genome assemblies have not yet achieved the desirable level of quality. Forming the genome assemblies of some species with large or complex genomes poses challenges. Moreover, current technologies, such as long read sequencing and mate-pair sequencing, cannot be used to generate high-quality genome assemblies for some rare or extinct species, due to available DNA of these species being either degenerate or ancient. Therefore, in silico mate pair assembly may still be usable, especially for those species with only some degenerate DNA or ancient samples.
The phylogenetic distance to target species, quality, and completeness of the reference genome, as well as its overall synteny and transposable element content, affects the final quality of target genome assemblies. Thus, not all references are appropriate for genome assembly of a target species. Therefore, we tested multiple references with different phylogenetic distances to the genome assembly of the target species. This was demonstrated while constructing the genome assemblies ofC. batrachus , T. bimaculatus , and T. buxtoni usingin silico mate pair libraries that were generated using different references separately. In summary, a reference from the same genus as that of the target species is the best for making in silico mate pairs, compared with divergent references. In addition to phylogenetic distance, the quality of the reference genome also affected the target genome assembly. For example, the number of in silico mate pairs generated from the B. grunniens genome (different genera but same subfamily) to assemble the genome of T. buxtoni , was higher than those generated from T. scriptus or T. strepsiceros (same genus). The genome of B. grunniens had an N50 of 114 Mb, which was much larger than that of T. scriptus (890 Kb) or T. strepsiceros (511 Kb). Nevertheless, the number of complete BUSCO genes in the target genome assembled using B. grunniens as the reference was only slightly higher than that using the congener as the reference. Thus, the quality and completeness of references influence the final assemblies, but to a lesser extent than the influence of the phylogenetic distance of the reference species to the target.
Misassemblies, a common issue encountered in genome assembly, are mainly caused by sequencing or assembler errors. In de novo assembly based on long sequence reads, polishing with short reads is often used to improve the base-pair accuracy of assemblies (Rice & Green, 2019). Misassemblies in reference-guide genome assemblers or scaffolders are inevitable due to unknown synteny and transposable element content discrepancies between the references and target species. This issue is particularly severe for assemblers that are designed based on one reference, which limits the wider use of reference-guide assembly algorithms or tools. Thus, the feasibility of reducing misassemblies in final genome assemblies is an important issue that needs to be explored by genomic studies. Therefore, we optimized the in silicomate-pair method by searching for conserved in silico mate pairs that reduce final misassemblies, under the assumption that conserved mate pairs would display more consistent synteny in the target species. We found that using three or more references (family or order conserved) reduced the number of misassemblies dramatically, but only by scarifying high contiguity and the number of complete genes. However, using two references from the same genus of the target species balanced contiguity, accuracy, and gene completeness of the final assemblies. By contrast, the original in silico mate-pair method using one reference resulted in more complete genes as well as in more misassemblies. Closer examination of these extra genes indicated that many did not exist in the “true” genome or were erroneous.
An increasing amount of sequence data of aDNA samples has been observed since the initial application of high-throughput sequencing to ancient human remains, (Rasmussen et al., 2010) over 2000 ancient samples being recorded (Brunson & Reich, 2019). In addition to the limitations of aDNA sequences, such as read length and contamination, data processing and analysis algorithms lag behind current speeds and costs. This impedes paleogenomics, with particular reference to the recovery of the full nuclear genome. The genome assembly of ancient DNA data relies on the alignment of sequencing reads to a linear reference genome, leading to the selection of endogenous DNA sequences. Thus, we simulated aDNA sequences and used these for genome assembly via different methods. The results suggested that the optimized in silico mate-pair method performed better than the use of aDNA reads alone or the originalin silico mate-pair method. It also outperformed the assembler, RaGOO, in the level of accuracy, which may be attributed to the design of RaGOO, which is based only on one reference.
Use of in silico mate pairs for scaffolding is a simple method that enables long-range distance information from a reference genome to be incorporated into a de novo genome assembly, via the generation of in silico mate-pair libraries. It is essentially a novel reference-guide approach, since other chromosome scaffolders, such as Chromosomer (Tamazian et al., 2016), MeDuSa (Bosi et al., 2015), AlignGraph (Bao et al., 2014), and RaGOO (Alonge et al., 2019) exploit distance information from a genome of a closely related organism to order and extend scaffolds or contigs after the de novo assembly process. By contrast, in silico mate-pair libraries obtain distance information prior to the assembly process and can be adapted to any genome assembler that accepts mate-pair sequences as input. The contiguity of a genome assembly may be improved via the application ofin silico methods or other reference-guided approaches. However, some reference-guided scaffolders rely heavily on paired-end or long-length read information, making these unsuitable for single-end reads. In addition, a large proportion of these reference-guided scaffolders are designed based only on one reference, resulting in many misassemblies in the draft genomes. Finally, all reference-guided genome assemblers or scaffolders have limitations, where only the conserved regions between target species and references are clear, while the sequence information between the conserved regions remains unknown.