Reference genome
To create the chromosome-scale reference genome, blood of one adult
female common tern was collected in 100% EtOH and stored at -80°C. Four
sequence datasets were generated following the VGP 1.5 pipeline (Rhieet al. 2021): 67.91x Pacific Biosciences (Pacbio) continuous long
reads (CLR); 698.35x Bionano Genomics optical maps; 169.30x 10X Genomics
linked-reads; and 79.62x Arima Hi-C Illumina reads.
Briefly, 30µg of High Molecular Weight DNA (HMW DNA) was isolated from
the whole blood sample using a modified (for avian nucleated
erythrocytes) agarose plug protocol of the Bionano Prep Blood and Cell
Culture DNA Isolation Kit (cat no. RE-130-10). Lysates were embedded
into agarose plugs, followed by Proteinase K and RNase A treatments and
1X TE drop dialysis purification. To create the Pacbio data, DNA was
sheared using a 26G blunt end needle (Pacbio protocol PN 101-181-000
Version 05) to approximately ~40kb fragment length. We
used 10µg of this fragmented DNA to generate a large-insert Pacbio
library using the Pacific Biosciences Express Template Prep Kit v1.0
(#101-357-000). The library was then size selected (>15kb)
using the BluePippin system (Sage Science). The resulting PacBio Library
was sequenced on 10 PacBio 1M v3 smrtcells (#101-531-000) on a Sequel
instrument with the sequencing kit 3.0 (#101-427-500) and a 10 hours
movie with 2 hours pre-extension time. Unfragmented HMW DNA was used to
generate a linked-read library on the 10X Genomics Chromium (Genome
Library Kit & Gel Bead Kit v2 PN-120258, Genome Chip Kit v2 PN-120257,
i7 Multiplex Kit PN-120262). We sequenced this 10X library on an
Illumina Novaseq S4 150bp PE lane. uHMW DNA was labeled for Bionano
Genomics optical mapping using the Bionano Prep Direct Label and Stain
(DLS) Protocol (30206E) and 1 flow cell was run on the Saphyr
instrument. Hi-C libraries were generated with the Arima Genomics v1.0
2-enzyme protocol (P/N: A510008), according to the manufacturer’s
protocol and sequenced on Illumina HiSeq X.
The resulting four data types were processed using the VGP v1.5 pipeline
(Rhie et al. 2021), which includes: assembling Pacbio contigs
using FALCON v2018.31.08-03.06 ; FALCON-Unzip v6.0.0.47841 ;
purging false haplotype duplications with purge_haplotigs v1.0.3+
1.Nov. 2018 ; scaffolding with 10X with scaff10x v4.1.0;scaffolding with Bionano Solve DLS v3.2.1 ; scaffolding with Hi-C
data with Salsa HiC v2.2 ; filling in gaps and polishing for base
call accuracy with CLR and Arrow smrtanalysis v6.0.0.47841 ; and
polishing with Illumina short reads with longranger align v2.2.2 ;
and freebayes v1.3.1 . The resulting assembly was then manually
curated to fix any errors, using gEVAL and Hi-C short read
linked-read mapping profiles as described in Howe et al. (2021).BUSCO v4.1.4 with the bird lineage dataset (aves_odb10 )
was used to assess assembly completeness. The reference genome was
submitted to NCBI with the following accession number: GCA_009819605.1,
as part of the Vertebrate Genome Project (VGP)
(https://vgp.github.io/genomeark). Our reference genome has
GC-rich promoter regions due to the use of Pacbio long reads that get
through them (Rhie et al. 2021; Kim et al. 2021).
Reference genomes consisting of short-read assemblies (e.g., using
Illumina reads) exhibit GC bias, where GC-rich regions such as promoters
could be incorrectly assembled or even missing (Kim et al. 2021).