Genotyping
To obtain genomic data consistent with our previous study, we processed
the samples as described in Zhang et al.(Zhang et al., 2018). After
double digestion with restriction enzymes MseI and HaeIII ,
all DNA samples were genotyped by high-throughput sequencing using an
Illumina HiSeq 4000 sequencer (Illumina, San Diego, CA, USA) and the
protocol provided by the manufacturer. To improve mapping, in-house
scripts were used to remove low-quality reads from the data set. Reads
were excluded if they (i) contained adapter sequences, (ii) if ≥10% of
nucleotides were unidentified (N), or (iii) if>50% bases had low phred
quality scores (<5). The remaining high quality paired-end reads were
mapped to the Gallus gallus 5.0 reference genome using the
Burrows-Wheeler Alignment tool (BWA) (v0.7.8) (Li and Durbin, 2009) with
default parameters. PCR duplicates were removed using
SAMtools rmdup (v1.3.1) (Li et al.,
2009).
The aligned BAM files for the 361 chickens were used to detect variants
at the population scale using the SAMtools suite (v1.3.1), including
BCFtools, with parameters as described in our previous study (Zhang et
al., 2018). Single nucleotide variants (SNVs) within 5 bp of an indel
were removed. SNPs and INDELs were annotated with ANNOVAR v2013– 08-23
(ANNOVAR, RRID:SCR 012821) (Wang et al., 2010), using gene annotations
obtained from the Ensembl database . In the annotation step, SNPs and
INDELs were classified into eight categories based on genomic locations,
including exonic regions (synonymous, nonsynonymous, stop gain, and
stop loss), splicing sites, intronic
regions, 5’ and 3’ UTRs, upstream and downstream regions, and intergenic
regions. The dbSNP database was used
to identify novel genetic variations.
SNPs located on non-chromosomes were removed. Data were also excluded if
they met any of the following four criteria: (1) individuals with
missing genotype data for more than 5% of the typed SNPs (call rate ≥
0.95). (2) variants with missing call rates ≤ 0.01. (3) SNPs with very
low minor allele frequencies (MAF ≥ 0.01). (4) SNPs with frequencies
that deviate significantly from Hardy-Weinberg Equilibrium (P-value
> 10e-6). Removal of low-quality SNPs helped to avoid
false-positives and also enhanced the ability to identify loci
associated with traits and estimate effective genomic diversity.
Population structure analysis The neighbor-joining (NJ)
tree was constructed using the neighbor-joining method in MEGA v7.0
(Kumar et al., 2016) and was visualized in FigTree v1.4.4 (Rambaut,
2018). Population stratification was analyzed by complete linkage
clustering of individuals using genome-wide SNP data in PLINK (Purcell
et al., 2007). A principal component analysis (PCA) (Price et al., 2006)
was conducted using PLINK, and
scatter
plots were generated using R v3.5.3 . Population structure was analyzed
using
ADMIXTURE
v1.3.0 (Alexander et al., 2009),
which
applies a likelihood model-to large whole-genome SNP genotype datasets.
The number of populations (K) varied from K = 2 to 9 to obtain the
maximum likelihood estimates for inference of population structure.
Cross-validation was performed to provide a low cross-validation error,
which made the optimal K value more apparent. The parameter standard
errors were estimated using 1000 bootstrap replicates. The
cross-validation plot was generated using R v3.5.3 .
Genomic diversity assessment within populations Allelic
richness (Ar), proportion of polymorphic markers (Pn), expected
heterozygosity (He), and observed heterozygosity (Ho) were used to
investigate genome-wide genomic diversity within 12 subpopulations. Ar
was calculated using ADZE v.1.0 (Szpiech et al., 2008). Pn, He, and Ho
were calculated using PLINK
v1.9
(Purcell et al., 2007).
Evaluation of inbreeding coefficient (F) Two metrics
were used to estimate levels of inbreeding in the conserved chicken
populations.
The FES inbreeding coefficient is based on the
mating system. The relative change in average inbreeding
(∆F)
was obtained by linear regression of the average annual inbreeding
coefficient over time, Ft = 1- (1-∆F)t, where t
represents the generation.
The
increment of hypothetical inbreeding (∆F) is different for different
conservation retention modes. For random mating, random selection\(F=\frac{1}{8Nm}+\frac{1}{8Nf}\),
and for random mating within
families\(,\ F=\frac{3}{32Nm}+\frac{1}{32Nf}\), where Nf and Nm
represents number of dams and sires, respectively.
The FROH inbreeding coefficient is based on runs
of homozygosity (ROH). The FROH statistic,
introduced by McQuillan et al. (McQuillan et al., 2008), was calculated
as follows: FROH =
LROH/LAUT, where LROH is
the total length of all ROH in the genome of an individual, and
LAUT is the specific length of the autosomal genome
covered by SNPs.
Calculation
of nucleotide diversity The nucleotide diversity (π) for each
population was calculated using VCFtools v0.1.14 (Danecek et al., 2011),
based on whole genome SNPs.
Linkage disequilibrium decay Genome-wide LD was
evaluated between in situ and ex situ groups. The average
LD of a pair of SNPs in a 300 kb sliding window was estimated using
Haploview (Barrett et al., 2004), and the LD decay curves were generated
using R v.3.5.3 and Adobe Illustrator CC 2018.
Estimation of population differentiation using
FSTThe
fixation index (FST ), a measure of population
differentiation and population structure (Weir and Cockerham, 1984), was
estimated using VCFtools
v0.1.14
(Danecek et al., 2011) with a 100 kb window and 10 kb step size.
Effective population size Different methods can be used
to compute the effective population size (Ne). When based on the number
of parents, Ne was calculated with the formula
Ne\(=\frac{4NmNf}{Nm+Nf}\), where
Nf and Nm represent number of dams
and sires, respectively (Groeneveld et al., 2009). Ne can also be based
on the rate of inbreeding using the formula Ne\(=\frac{1}{2*F}\)(Groeneveld et al., 2009). Here, we used NeEstimator v.2.01(Do et al.,
2014a) to implement the linkage disequilibrium (LD) approach of
Waples
and Do (Waples and Do, 2008) to estimate effective population size. Ne
estimates for each subpopulation were calculated as the average of the
estimates for macrochromosomes (gga1-gga5) (Axelsson et al., 2005).
Runs of homozygosity (ROH) To investigate recent
inbreeding and the distribution of homozygosity, we identified ROH based
on the autosomal SNPs using PLINK v1.9 (Purcell et al., 2007). The
analysis was conducted using the default parameter -homozygosity ,
and the following criteria were also selected: (1) a sliding window of
50 SNPs across the genome; (2) one heterozygous and five missing calls
were allowed per window to account for genotyping error; (3) the minimum
number of consecutive SNPs included in a run of homozygosity was set to
50 and the minimum length for a run was set to 100 kb; (4) the required
minimum SNP density to define a run was 1 SNP per 50 kb; (5) the maximum
distance between two consecutive SNPs in a run was 1000 kb (Zhang et
al., 2018).
Differences in genome-wide homozygosity between in situ andex situ populations were tested for statistical significance with
three measures: numbers of runs of homozygosity (NSEG); total length of
runs (KB); average length of runs (KBAVG).
Adaptation analysis To analyze the underlying genetic
mechanisms of adaptation between the in situ and ex situconserved populations, we employed multiple statistical tests to
identify genomic regions harboring footprints of positive selection
between the groups. We used FST (Akey et al.,
2002; Holsinger and Weir, 2009; WRIGHT, 1949), Pi (nucleotide diversity)
(Nei and Li, 1979; Wang et al., 2016), and XP-EHH (cross-population
extended haplotype homozygosity) (Sabeti et al., 2007). A sliding window
approach (100 kb windows sliding in 10 kb steps) was applied to quantify
the polymorphism levels, using pairwise nucleotide variation as a
measure of variability (θπ) and genetic differentiation
(FST ) between populations. Genome signatures with
significantly high FST values corresponding to
the top 5% of values), and θπ ratios in the top 5% of values
(θπ,in situ /θπ, ex situ ) were classified as extensively
diversified. XP-EHH scores were calculated with Selscan (Szpiech and
Hernandez, 2014) with default parameters to compare whole genome SNPs in
all three chicken breeds between in situ and ex situconserved populations.The scores for each SNP were then
frequency-normalized over all chromosomes using the script norm,
provided with Selscan.
Breed characteristics Variants and genes that underlie
phenotypic changes in the domestic chicken likely evolved rapidly after
domestication. Based on the genomic variation data obtained in this
study, we identified regions that may have evolved rapidly in domestic
chickens, and thus might have contributed to trait differences amongst
breeds. Comparing these regions with data from the Chicken QTL database,
we obtained candidate genomic regions, identified as “genomic
conservation units”, and used them as markers for breed-specific
characteristics.
Typically, regions or loci that have evolved rapidly and have
experienced selection exhibit specific signatures of variation,
including high population differentiation, significantly reduced
nucleotide diversity levels, and long-range haplotype homozygosity
(Sabeti et al., 2006). To detect such regions and identify the “genomic
conservation unit” for a given chicken breed, we first calculatedFST values to measure population differentiation
using a non-overlapping window approach with VCFtools (Danecek et al.,
2011). Then, we calculated the
statistic\(di=\sum_{j\neq i}^{\infty}\left(\frac{\text{FST}^{\text{ij}}-E[\text{FST}^{\text{ij}}]}{sd[\text{FST}^{\text{ij}}]}\right)\)(Akey et al., 2010)
for each SNP, where \(\text{FST}^{\text{ij}}\) and\(sd[\text{FST}^{\text{ij}}\)] represent the expected value and
standard deviation of FST between breeds i and j
calculated from 28 autosomes. Finally, Di was averaged over SNVs in
non-overlapping 100 kb windows, and we empirically selected the
significantly high FST values (the most extreme
5% occupying the right tail) as candidate signals. As a further measure
of selection, Pi was calculated for each population with non-overlapping
100 kb windows using VCFTools. Then,
the
Pi ratio for the three breeds was calculated by the
formula:
Pi
ratio = log2(Pi1/ (Pi2 + Pi3)/2).
Genome Annotation and Functional enrichment analysis We
used the Ensembl Gallus gallus BioMart webtools to retrieve genes
associated with selected genomic regions identified using the methods
described above. Retrieved regions were compared to the Animal QTL
Database (Hu et al., 2013)
(http://www.animalgenome.org/QTLdb)
to identify candidate regions or genes associated with interesting
phenotypic or economic traits. Functional enrichment analyses for Gene
Ontology (GO) terms, metabolic pathways, and InterPro domains were
performed using the R “clusterprofiler” package (Yu et al., 2012). All
chicken genes that were annotated in Ensembl were used as a background
set. P values (i.e., EASEscore), indicating significance of the overlap
between various gene sets, were calculated using Benjamini-corrected
modified Fisher’s exact test. Only terms with a P value less than 0.05
were considered as significant.