Phylogenomic and population genomic analyses
Using both the probe and contig SNP datasets, we evaluated the
consistency of technical replicates in a variety of applications along
the phylogeography-phylogenetics continuum (Edwards et al. 2016). Using
both datasets, we explored the utility of replicates in a phylogenetic
context, expecting to find that technical replicates of the same
specimen would cluster together in the phylogeny. We used IQ-TREE
v.1.6.12 (Nguyen et al. 2015) to construct a maximum likelihood
phylogeny of concatenated SNPs with 100 ultrafast bootstrap replicates
with the BIC best-fit model = TVM+F+R4 (based on IQ-TREE ModelFinder
analysis).
Using both datasets, we explored model-free population structure
estimation using principal component analyses (PCA) with thedudi.pca function implemented in ‘Ade4’ v.1.7.15 (Dray & Dufour,
2007) for all replicates and specimens. We also explored the impact of
samples with high levels of missing data on PC space by filtering the
contig and probe datasets to exclude replicates with <70% of
SNPs (n = 27). Finally, to explore the effect of unequal missingness
between capture-based replicates on PC space, we filtered the contig
dataset to exclude RADseq replicates and pruned SNPs missing from
>10% of samples, leaving 713 SNPs, as well as the
supernatant replicate of USNM 525151 which was an outlier in PC space
(n=25).
Using both datasets, we compared estimates of nucleotide diversity
between technical replicates (capture-based formalin fixed, supernatant,
and frozen pellet SNPs, and RADseq SNPs) following O’Connell et al.
(2020) by implementing permutation tests. We randomly subsampled the
data for each replicate type to a minimum and maximum number of samples
using vcftools and estimated nucleotide diversity using the populations
module in STACKS v.2.54 (Catchen et al. 2013). We repeated this
procedure 100 times and plotted the distribution in R (R Core Team). We
tested minimum and maximum sample values of 2-10, 5-6, and 5-10 to
explore the sensitivity of our estimates to sample size variation.
Because we observed very little variation between subsampling regimes,
we only present results for minimum samples of five (number of
formalin-fixed replicates minus 1) and maximum samples of 10 individuals
(number of supernatant and pellet replicates). We observed significant
differences in nucleotide diversity estimates between technical
replicates (see Results); thus to explore the impact of biases in levels
of missing data between replicates we further filtered our data to SNPs
present in 95% of samples (strict filtering; 298 SNPs). Further, for
both datasets, we calculated the proportion of non-missing SNPs that
were heterozygous/homozygous for each replicate type.