Phylogenomic and population genomic analyses
Using both the probe and contig SNP datasets, we evaluated the consistency of technical replicates in a variety of applications along the phylogeography-phylogenetics continuum (Edwards et al. 2016). Using both datasets, we explored the utility of replicates in a phylogenetic context, expecting to find that technical replicates of the same specimen would cluster together in the phylogeny. We used IQ-TREE v.1.6.12 (Nguyen et al. 2015) to construct a maximum likelihood phylogeny of concatenated SNPs with 100 ultrafast bootstrap replicates with the BIC best-fit model = TVM+F+R4 (based on IQ-TREE ModelFinder analysis).
Using both datasets, we explored model-free population structure estimation using principal component analyses (PCA) with thedudi.pca function implemented in ‘Ade4’ v.1.7.15 (Dray & Dufour, 2007) for all replicates and specimens. We also explored the impact of samples with high levels of missing data on PC space by filtering the contig and probe datasets to exclude replicates with <70% of SNPs (n = 27). Finally, to explore the effect of unequal missingness between capture-based replicates on PC space, we filtered the contig dataset to exclude RADseq replicates and pruned SNPs missing from >10% of samples, leaving 713 SNPs, as well as the supernatant replicate of USNM 525151 which was an outlier in PC space (n=25).
Using both datasets, we compared estimates of nucleotide diversity between technical replicates (capture-based formalin fixed, supernatant, and frozen pellet SNPs, and RADseq SNPs) following O’Connell et al. (2020) by implementing permutation tests. We randomly subsampled the data for each replicate type to a minimum and maximum number of samples using vcftools and estimated nucleotide diversity using the populations module in STACKS v.2.54 (Catchen et al. 2013). We repeated this procedure 100 times and plotted the distribution in R (R Core Team). We tested minimum and maximum sample values of 2-10, 5-6, and 5-10 to explore the sensitivity of our estimates to sample size variation. Because we observed very little variation between subsampling regimes, we only present results for minimum samples of five (number of formalin-fixed replicates minus 1) and maximum samples of 10 individuals (number of supernatant and pellet replicates). We observed significant differences in nucleotide diversity estimates between technical replicates (see Results); thus to explore the impact of biases in levels of missing data between replicates we further filtered our data to SNPs present in 95% of samples (strict filtering; 298 SNPs). Further, for both datasets, we calculated the proportion of non-missing SNPs that were heterozygous/homozygous for each replicate type.