2.2 Reference genome sequencing and assembly
We used a combination of long-fragment sequencing, short-insert library
sequencing for error correction and gapfilling, and chromatin
conformation capture (Hi-C) to generate chromosome-level semelparous
mammal reference genomes. High-molecular weight (HMW) DNA extracted from
the testis of the A. flavipes individual ‘AdamAnt’ was used to
generate long-read (PacBio) sequencing data by Annoroad Gene Technology
(Beijing, China). Paired-end (2 × 100 bp) BGI-SEQ500 data were generated
from cerebrum, liver, heart, and lung tissue from the same individual by
BGI-Qingdao. A total of 323.85 Gb (~100×) A.
flavipes PacBio reads were assembled using Canu v1.7 (Koren et al.,
2017) with the error correction module. The corrected subreads were used
for initial draft assembly using Wtdbg2 v1.2.8 (Ruan & Li, 2020). To
reduce base errors, the assembly was ‘polished’ using Pilon v1.23
(Walker et al., 2014) and 151.43 Gb (50×) 100 bp paired-end BGISEQ-500
reads [mapped to the initial PacBio assembly using Minimap2 v2.10 (Li,
2018) and SAMtools v1.9 (Li et al., 2009)].
Genome sizes were estimated by k -mer frequency analysis (Liu et
al., 2013). Briefly, 100 bp paired-end WGS reads were used as input into
the GCE (Genomic Charactor Estimator) v1.0.0 (Marcais & Kingsford,
2011) to obtain the k -mer frequency and the genome size was
estimated using the equation ‘Genome size = k -mer number /k -mer depth’, where the ‘k -mer number’ is the total number
of k -mers and ‘k -mer depth’ denotes the peak frequency
that occurred more than any other frequencies. Genome length was
estimated on the basis of total scaffold length of the assembly. Using
the frequency distribution of 17-mers of short paired-end reads
(Figure S1 ), the A. flavipes genome was estimated to be
3.2 Gb.
Assembly quality was assessed using BUSCO (Benchmarking Universal
Single-Copy Orthologs) v5.0.0_cv1 (Seppey, Manni, & Zdobnov, 2019),
employing the gene predictor AUGUSTUS v3.2.1 (Stanke & Waack, 2003) and
the 9,226-gene BUSCO mammalian lineage data set (mammalia_odb10).
Although, gene centric, the BUSCO Score is a good predictor of genome
completeness (Seppey et al., 2019).