Genotyping
To obtain genomic data consistent with our previous study, we processed the samples as described in Zhang et al.(Zhang et al., 2018). After double digestion with restriction enzymes MseI and HaeIII , all DNA samples were genotyped by high-throughput sequencing using an Illumina HiSeq 4000 sequencer (Illumina, San Diego, CA, USA) and the protocol provided by the manufacturer. To improve mapping, in-house scripts were used to remove low-quality reads from the data set. Reads were excluded if they (i) contained adapter sequences, (ii) if ≥10% of nucleotides were unidentified (N), or (iii) if>50% bases had low phred quality scores (<5). The remaining high quality paired-end reads were mapped to the Gallus gallus 5.0 reference genome using the Burrows-Wheeler Alignment tool (BWA) (v0.7.8) (Li and Durbin, 2009) with default parameters. PCR duplicates were removed using SAMtools rmdup (v1.3.1) (Li et al., 2009).
The aligned BAM files for the 361 chickens were used to detect variants at the population scale using the SAMtools suite (v1.3.1), including BCFtools, with parameters as described in our previous study (Zhang et al., 2018). Single nucleotide variants (SNVs) within 5 bp of an indel were removed. SNPs and INDELs were annotated with ANNOVAR v2013– 08-23 (ANNOVAR, RRID:SCR 012821) (Wang et al., 2010), using gene annotations obtained from the Ensembl database . In the annotation step, SNPs and INDELs were classified into eight categories based on genomic locations, including exonic regions (synonymous, nonsynonymous, stop gain, and stop loss), splicing sites, intronic regions, 5’ and 3’ UTRs, upstream and downstream regions, and intergenic regions. The dbSNP database was used to identify novel genetic variations.
SNPs located on non-chromosomes were removed. Data were also excluded if they met any of the following four criteria: (1) individuals with missing genotype data for more than 5% of the typed SNPs (call rate ≥ 0.95). (2) variants with missing call rates ≤ 0.01. (3) SNPs with very low minor allele frequencies (MAF ≥ 0.01). (4) SNPs with frequencies that deviate significantly from Hardy-Weinberg Equilibrium (P-value > 10e-6). Removal of low-quality SNPs helped to avoid false-positives and also enhanced the ability to identify loci associated with traits and estimate effective genomic diversity.
Population structure analysis The neighbor-joining (NJ) tree was constructed using the neighbor-joining method in MEGA v7.0 (Kumar et al., 2016) and was visualized in FigTree v1.4.4 (Rambaut, 2018). Population stratification was analyzed by complete linkage clustering of individuals using genome-wide SNP data in PLINK (Purcell et al., 2007). A principal component analysis (PCA) (Price et al., 2006) was conducted using PLINK, and scatter plots were generated using R v3.5.3 . Population structure was analyzed using ADMIXTURE v1.3.0 (Alexander et al., 2009), which applies a likelihood model-to large whole-genome SNP genotype datasets. The number of populations (K) varied from K = 2 to 9 to obtain the maximum likelihood estimates for inference of population structure. Cross-validation was performed to provide a low cross-validation error, which made the optimal K value more apparent. The parameter standard errors were estimated using 1000 bootstrap replicates. The cross-validation plot was generated using R v3.5.3 .
Genomic diversity assessment within populations Allelic richness (Ar), proportion of polymorphic markers (Pn), expected heterozygosity (He), and observed heterozygosity (Ho) were used to investigate genome-wide genomic diversity within 12 subpopulations. Ar was calculated using ADZE v.1.0 (Szpiech et al., 2008). Pn, He, and Ho were calculated using PLINK v1.9 (Purcell et al., 2007).
Evaluation of inbreeding coefficient (F) Two metrics were used to estimate levels of inbreeding in the conserved chicken populations.
The FES inbreeding coefficient is based on the mating system. The relative change in average inbreeding (∆F) was obtained by linear regression of the average annual inbreeding coefficient over time, Ft = 1- (1-∆F)t, where t represents the generation. The increment of hypothetical inbreeding (∆F) is different for different conservation retention modes. For random mating, random selection\(F=\frac{1}{8Nm}+\frac{1}{8Nf}\), and for random mating within families\(,\ F=\frac{3}{32Nm}+\frac{1}{32Nf}\), where Nf and Nm represents number of dams and sires, respectively.
The FROH inbreeding coefficient is based on runs of homozygosity (ROH). The FROH statistic, introduced by McQuillan et al. (McQuillan et al., 2008), was calculated as follows: FROH = LROH/LAUT, where LROH is the total length of all ROH in the genome of an individual, and LAUT is the specific length of the autosomal genome covered by SNPs.
Calculation of nucleotide diversity The nucleotide diversity (π) for each population was calculated using VCFtools v0.1.14 (Danecek et al., 2011), based on whole genome SNPs.
Linkage disequilibrium decay Genome-wide LD was evaluated between in situ and ex situ groups. The average LD of a pair of SNPs in a 300 kb sliding window was estimated using Haploview (Barrett et al., 2004), and the LD decay curves were generated using R v.3.5.3 and Adobe Illustrator CC 2018.
Estimation of population differentiation using FSTThe fixation index (FST ), a measure of population differentiation and population structure (Weir and Cockerham, 1984), was estimated using VCFtools v0.1.14 (Danecek et al., 2011) with a 100 kb window and 10 kb step size.
Effective population size Different methods can be used to compute the effective population size (Ne). When based on the number of parents, Ne was calculated with the formula Ne\(=\frac{4NmNf}{Nm+Nf}\), where Nf and Nm represent number of dams and sires, respectively (Groeneveld et al., 2009). Ne can also be based on the rate of inbreeding using the formula Ne\(=\frac{1}{2*F}\)(Groeneveld et al., 2009). Here, we used NeEstimator v.2.01(Do et al., 2014a) to implement the linkage disequilibrium (LD) approach of Waples and Do (Waples and Do, 2008) to estimate effective population size. Ne estimates for each subpopulation were calculated as the average of the estimates for macrochromosomes (gga1-gga5) (Axelsson et al., 2005).
Runs of homozygosity (ROH) To investigate recent inbreeding and the distribution of homozygosity, we identified ROH based on the autosomal SNPs using PLINK v1.9 (Purcell et al., 2007). The analysis was conducted using the default parameter -homozygosity , and the following criteria were also selected: (1) a sliding window of 50 SNPs across the genome; (2) one heterozygous and five missing calls were allowed per window to account for genotyping error; (3) the minimum number of consecutive SNPs included in a run of homozygosity was set to 50 and the minimum length for a run was set to 100 kb; (4) the required minimum SNP density to define a run was 1 SNP per 50 kb; (5) the maximum distance between two consecutive SNPs in a run was 1000 kb (Zhang et al., 2018).
Differences in genome-wide homozygosity between in situ andex situ populations were tested for statistical significance with three measures: numbers of runs of homozygosity (NSEG); total length of runs (KB); average length of runs (KBAVG).
Adaptation analysis To analyze the underlying genetic mechanisms of adaptation between the in situ and ex situconserved populations, we employed multiple statistical tests to identify genomic regions harboring footprints of positive selection between the groups. We used FST (Akey et al., 2002; Holsinger and Weir, 2009; WRIGHT, 1949), Pi (nucleotide diversity) (Nei and Li, 1979; Wang et al., 2016), and XP-EHH (cross-population extended haplotype homozygosity) (Sabeti et al., 2007). A sliding window approach (100 kb windows sliding in 10 kb steps) was applied to quantify the polymorphism levels, using pairwise nucleotide variation as a measure of variability (θπ) and genetic differentiation (FST ) between populations. Genome signatures with significantly high FST values corresponding to the top 5% of values), and θπ ratios in the top 5% of values (θπ,in situ /θπ, ex situ ) were classified as extensively diversified. XP-EHH scores were calculated with Selscan (Szpiech and Hernandez, 2014) with default parameters to compare whole genome SNPs in all three chicken breeds between in situ and ex situconserved populations.The scores for each SNP were then frequency-normalized over all chromosomes using the script norm, provided with Selscan.
Breed characteristics Variants and genes that underlie phenotypic changes in the domestic chicken likely evolved rapidly after domestication. Based on the genomic variation data obtained in this study, we identified regions that may have evolved rapidly in domestic chickens, and thus might have contributed to trait differences amongst breeds. Comparing these regions with data from the Chicken QTL database, we obtained candidate genomic regions, identified as “genomic conservation units”, and used them as markers for breed-specific characteristics.
Typically, regions or loci that have evolved rapidly and have experienced selection exhibit specific signatures of variation, including high population differentiation, significantly reduced nucleotide diversity levels, and long-range haplotype homozygosity (Sabeti et al., 2006). To detect such regions and identify the “genomic conservation unit” for a given chicken breed, we first calculatedFST values to measure population differentiation using a non-overlapping window approach with VCFtools (Danecek et al., 2011). Then, we calculated the statistic\(di=\sum_{j\neq i}^{\infty}\left(\frac{\text{FST}^{\text{ij}}-E[\text{FST}^{\text{ij}}]}{sd[\text{FST}^{\text{ij}}]}\right)\)(Akey et al., 2010)
for each SNP, where \(\text{FST}^{\text{ij}}\) and\(sd[\text{FST}^{\text{ij}}\)] represent the expected value and standard deviation of FST between breeds i and j calculated from 28 autosomes. Finally, Di was averaged over SNVs in non-overlapping 100 kb windows, and we empirically selected the significantly high FST values (the most extreme 5% occupying the right tail) as candidate signals. As a further measure of selection, Pi was calculated for each population with non-overlapping 100 kb windows using VCFTools. Then, the Pi ratio for the three breeds was calculated by the formula: Pi ratio = log2(Pi1/ (Pi2 + Pi3)/2).
Genome Annotation and Functional enrichment analysis We used the Ensembl Gallus gallus BioMart webtools to retrieve genes associated with selected genomic regions identified using the methods described above. Retrieved regions were compared to the Animal QTL Database (Hu et al., 2013) (http://www.animalgenome.org/QTLdb) to identify candidate regions or genes associated with interesting phenotypic or economic traits. Functional enrichment analyses for Gene Ontology (GO) terms, metabolic pathways, and InterPro domains were performed using the R “clusterprofiler” package (Yu et al., 2012). All chicken genes that were annotated in Ensembl were used as a background set. P values (i.e., EASEscore), indicating significance of the overlap between various gene sets, were calculated using Benjamini-corrected modified Fisher’s exact test. Only terms with a P value less than 0.05 were considered as significant.