3 Material and methods
Insect strain
The R. pedestris bean bugs sequenced in this study were originally collected in a soybeanZhengqing field in Suzhou, China in 2019. The insects were reared on soybean plants (strain: Wandou 27) at 26 ± 0.5 °C with humidity of 50 ± 5% under a 16/8 h (light/dark) photoperiod. Prior to DNA/RNA extraction, the colony was inbred following a sib-sib mating protocol for three generations as previously described (Xue et al. 2015).
Genome sequencing and assembly
Genomic DNA was isolated from three virgin female adults using the DNeasy Blood & Tissue Extraction Kit (Qiagen Inc., Valencia, CA, USA) based on the manufacturer’s protocol. After the determination of the DNA quality and quantity, a paired-end sequencing library (150 bp in length) was constructed and sequenced using the Illumina NovaSeq6000. In addition, a 20 kb library was constructed and sequenced on 17 cells using the SequelTM Sequencing Plate 1.2 on the Pacific Biosciences Sequel platform at Berry Genomic Corporation in Beijing, China.
De novo genome assembly was performed using Canu-SMARTdenovo methods (Schmidt et al. 2017). Correction was performed by Canu with the default parameters (Koren et al. 2017). The resulting reads were assembled using SMARTdenovo (-k 21 -J 3000 -t 20) (Hu et al. 2020). Nextpolish was used for genome polishing (Hu et al. 2020).
Hi-C sequencing and chromosome-length genome assembly
Hi-C libraries were constructed from a pool of fifty 2nd instar nymphs according to the method previously described (Lieberman-Aiden et al. 2009). Briefly, the samples were fixed with 1% formaldehyde for 10 min at room temperature. Then, a 2.5 M-glycine solution was added to obtain a final concentration of 0.2 M. The sample was incubated for 5 min at room temperature and then on ice for 15 min to stop cross-linking completely. After centrifugation, the cross-linked sample was sent to Annoroad Gene Technology Co. Ltd, Beijing, China for sequencing.
The genomic contigs were scaffolded using the 3D de novo assembly (3D-DNA) pipeline (Dudchenko et al. 2017). Briefly, the Hi-C reads were aligned to the draft genome assembly using Juicer; a 3D-DNA analysis was run to generate a candidate assembly; the candidate assembly was reviewed using Juicebox Assembly Tools (JBAT); and a high-quality chromosome-length genome assembly was generated after JBAT review.
Chromosomic staining
The salivary glands were dissected from male and female R. pedestris in a phosphate-buffered saline (PBS) solution (137 mM NaCl, 2.68 mM KCl, 8.1 mM Na2HPO4 and 1.47 mM KH2PO4 at pH 7.4) under a stereomicroscope (COIC, Chongqing, China). The samples were fixed in 4% paraformaldehyde (Sangon Biotech, Shanghai, China) for 1 h. After washing in PBS 3 times, the samples were triturated mechanically, and 20 μl DAPI solution (Abcam, Cambridge, USA) was added to stain the chromosomes. Fluorescence images were examined using a Leica confocal laser-scanning microscope (Leica, Heidelberg, Germany).
Transcriptomic sequencing
Third generation transcriptomic sequencing was performed by Oxford Nanopore technology (ONT). Four samples, including eggs, nymphs, male, and female R. pedestris were collected. Oxford PromethION 2D amplicon libraries were generated according to the Nanopore community protocol using library preparation kit SQK-LSK109 (Oxford Nanopore, Oxford, UK) and sequenced on R9 flow cells to generate fast5 files. All generated fast5 reads were then basecalled in Guppy v3.2.10 with the default options to yield fastq files (Bolognini et al. 2019). The fastq reads for each sample were filtered using Nanofilt v2.5.0 with options -l 100 -q 7 (De Coster et al. 2018). The full length reads were detected using Pychopper with arguments -b primer.fa -i raw.fq -t 200 -s 98 (https://github.com/nanoporetech/pychopper). Pinfish was run with default parameters, and finally yielded polished consensus reads. Second generation Illumina sequencing was performed as we previously described (Huang et al. 2020). Insect samples from 6 different tissues and 37 different development stages were analyzed.
Genome annotation
Genome annotation was performed using the LoReAn annotation pipeline (Cook et al., 2019). Briefly, the transcriptomic data generated above were aligned to the genome using the Program to Assemble Spliced Alignments (PASA) (Haas et al. 2008) and the Genomic Mapping and Alignment Program (GMAP) (Wu and Watanabe 2005). For the protein sequence map, GeneWise (Birney et al. 2004) was employed. For ab initio gene prediction, SNAP (Korf 2004), Augustus (Stanke et al. 2006), MAKER2 (Holt and Yandell 2011), GeneMark-ET (Lomsadze et al. 2014), and BRAKER1 (Hoff et al. 2016) were employed separately. The Evidence Modeler (EVM) (Haas et al. 2008) was used to combine the outputs of previous tools to generate the combined annotation model. PASA was used to update the gene models by identifying untranslated regions (UTRs) to generate a final annotation.
Sex chromosome identification
To identify the potential sex chromosome, paired-end sequencing libraries were constructed for male and female adult R. pedestris . The resulting DNA reads were separately mapped to the genomic scaffolds using Bowtie2 (Longmead and Salzberg 2012). Then, Sequence Alignment/Map tolls (SAMtolls) was used to remove the low-quality mapped reads, and the mapped reads per million (MRPM) of each chromosome in males and females were calculated (Li et al. 2009). The sex chromosomes were determined according to the female/male ratio of MRPM, as previously described (Ma et al. 2020).
Weighted gene co-expression network analysis (WGCNA)
The WGCNA was performed based on the WGCNA package in R (3.2.2.) (Langfelder and Horvath 2008). The parameters were set as follows: min module size = 50 and ME miss thread = 0.15. Only the enrichment index >0.85 was considered to be significantly clustered.
Phylogenetic analysis and divergence time estimation
Protein data from 12 hemipteran insects (Bemisia tabaci ,Acyrthosiphon pisum , Diaphorina citri , Homalodisca vitripennis , Laodelphax striatellus , Nilaparvata lugens ,Gerris buenoi , Halyomorpha halys , Oncopeltus fasciatus , Cimex lectularius , Rhodnius prolixus , andTriatoma rubrofasciata ) and Drosophila melanogaster were retrieved from the NCBI database, GigaDB database, VectorBase (Table S1). After filtering redundant alternative splicing events, the protein dataset containing non-redundant transcripts was used to find the homologous pairs of sequences by the all-vs-all BLASTp algorithm with an E-value of <1e-5. The BLASTp result was then converted into a normalized similarity matrix and processed using OrthoMCL v2.0.9 (Li et al., 2003). Protein families were identified by Markov chain clustering (MCL-14-137).
A Phylogenetic tree was constructed using single copy orthologs in each species (1:1:1 genes identified by OrthoMCL analysis), and D. melanogaster was used to root the tree. Sequence alignment was performed by MAFFT (v7.310) (Katoh and Standley 2013), and conserved amino-acid sites were identified by TrimAl (v1.2rev59) (Capella-Gutiérrez et al. 2009). We used Model founder in IQ-TREE (v1.6.10) to determine the best model, and the phylogenetic tree was constructed using the maximum likelihood method with 1000 bootstraps (Nguyen et al. 2015). The divergence time was estimated based on fossil collection records from the Paleobiology Database (www.paleobiology.org) using the MCMCtree program in PAML v4.9e (Yang 1997).