Reference genome
To create the chromosome-scale reference genome, blood of one adult female common tern was collected in 100% EtOH and stored at -80°C. Four sequence datasets were generated following the VGP 1.5 pipeline (Rhieet al. 2021): 67.91x Pacific Biosciences (Pacbio) continuous long reads (CLR); 698.35x Bionano Genomics optical maps; 169.30x 10X Genomics linked-reads; and 79.62x Arima Hi-C Illumina reads.
Briefly, 30µg of High Molecular Weight DNA (HMW DNA) was isolated from the whole blood sample using a modified (for avian nucleated erythrocytes) agarose plug protocol of the Bionano Prep Blood and Cell Culture DNA Isolation Kit (cat no. RE-130-10). Lysates were embedded into agarose plugs, followed by Proteinase K and RNase A treatments and 1X TE drop dialysis purification. To create the Pacbio data, DNA was sheared using a 26G blunt end needle (Pacbio protocol PN 101-181-000 Version 05) to approximately ~40kb fragment length. We used 10µg of this fragmented DNA to generate a large-insert Pacbio library using the Pacific Biosciences Express Template Prep Kit v1.0 (#101-357-000). The library was then size selected (>15kb) using the BluePippin system (Sage Science). The resulting PacBio Library was sequenced on 10 PacBio 1M v3 smrtcells (#101-531-000) on a Sequel instrument with the sequencing kit 3.0 (#101-427-500) and a 10 hours movie with 2 hours pre-extension time. Unfragmented HMW DNA was used to generate a linked-read library on the 10X Genomics Chromium (Genome Library Kit & Gel Bead Kit v2 PN-120258, Genome Chip Kit v2 PN-120257, i7 Multiplex Kit PN-120262). We sequenced this 10X library on an Illumina Novaseq S4 150bp PE lane. uHMW DNA was labeled for Bionano Genomics optical mapping using the Bionano Prep Direct Label and Stain (DLS) Protocol (30206E) and 1 flow cell was run on the Saphyr instrument. Hi-C libraries were generated with the Arima Genomics v1.0 2-enzyme protocol (P/N: A510008), according to the manufacturer’s protocol and sequenced on Illumina HiSeq X.
The resulting four data types were processed using the VGP v1.5 pipeline (Rhie et al. 2021), which includes: assembling Pacbio contigs using FALCON v2018.31.08-03.06 ; FALCON-Unzip v6.0.0.47841 ; purging false haplotype duplications with purge_haplotigs v1.0.3+ 1.Nov. 2018 ; scaffolding with 10X with scaff10x v4.1.0;scaffolding with Bionano Solve DLS v3.2.1 ; scaffolding with Hi-C data with Salsa HiC v2.2 ; filling in gaps and polishing for base call accuracy with CLR and Arrow smrtanalysis v6.0.0.47841 ; and polishing with Illumina short reads with longranger align v2.2.2 ; and freebayes v1.3.1 . The resulting assembly was then manually curated to fix any errors, using gEVAL and Hi-C short read linked-read mapping profiles as described in Howe et al. (2021).BUSCO v4.1.4 with the bird lineage dataset (aves_odb10 ) was used to assess assembly completeness. The reference genome was submitted to NCBI with the following accession number: GCA_009819605.1, as part of the Vertebrate Genome Project (VGP) (https://vgp.github.io/genomeark). Our reference genome has GC-rich promoter regions due to the use of Pacbio long reads that get through them (Rhie et al. 2021; Kim et al. 2021). Reference genomes consisting of short-read assemblies (e.g., using Illumina reads) exhibit GC bias, where GC-rich regions such as promoters could be incorrectly assembled or even missing (Kim et al. 2021).