FIGURE LEGENDS
Figure 1: Bioinformatic workflow used to generate SNP datasets in this study. The probe dataset included SNPs generated by mapping directly to RADseq-based target loci, while the contig dataset was generated by creating a de novo reference from assembled contigs and pruning SNPs to only those present on the original RADseq loci.
Figure 2: Summary of capture success by sample replicate for the contig dataset of 3997 unlinked SNPs. A) Relationship between extracted DNA (log scale) and number of SNPs in final matrix. B) Relationship between extracted DNA and percent exogenous sequence for formalin-fixed samples, showing that samples below ~200 ng had high levels of exogenous sequence. C) Mean coverage by replicate. Formalin-fixed replicates had significantly lower coverage. D) Percentages of reads that mapped to target sequences by replicate. Formalin-fixed and RADseq replicates were significantly different from supernatant and pellet replicates, but not from one another.
Figure 3: Results of allelic mismatch analyses for probe (A,B) versus contig (C,D) datasets. A­) Percent of SNPs with heterozygous/homozygous differences between technical replicates and B) Percent of SNPs with homozygous differences in the probe dataset with 2337 SNPs. C) Percent of SNPs with heterozygous/homozygous differences and D) Percent of SNPs with homozygous differences in the contig dataset with 3997 SNPs. Note that the probe dataset (A, B) shows consistently higher allelic differences in comparisons of RADseq and capture-based replicates whereas the contig dataset (C, D) shows similar allelic differences across all comparisons. Data for individual comparisons is given in full in Table 2. Abbreviations are S = supernatant, P = pellet, F = formalin-fixed, R = RADseq.
Figure 4: Maximum likelihood phylogeny with the highest log-likelihood for 3997 unlinked concatenated SNPs from the contig dataset estimated in IQ-TREE with 100 rapid bootstrap replicates. Gray circles represent nodes with >70% bootstrap support. Tip shapes represent replicate-type and are colored by geography (blue = Virginia, red = Ohio). Bars to the right of the phylogeny show the proportion of SNPs for each sample. All replicates cluster by specimen and by geography despite high levels of missing data in some replicates.
Figure 5: Principle component analysis (PCA) for the contig dataset. Data points are colored by geography (blue = Virginia, red = Ohio), shapes correspond to replicate-type, and size corresponds to data missingness, with larger shapes missing more data. Gray arrows highlight RADseq technical replicates. A) PCA of 33 samples and 3997 SNPs. Samples cluster by geography along PC1 and missing data and replicate type along PC2. B) PCA of 27 samples with >70% of the 3997 SNPs. Even with low levels of missing data, differences in clustering between RADseq and capture-based replicates are apparent.
Figure 6: Estimates of nucleotide diversity by replicate type for 3997 SNPs from the contig dataset based on 100 estimates with sample sizes ranging from five to ten individuals per replicate. Mean estimates of nucleotide diversity for formalin-fixed and RADseq replicates were significantly lower than supernatant and pellet replicates.
Supporting Information Figures:
Figure S1: Regression of extracted DNA against proportion of SNPs in the contig dataset for formalin-fixed samples.
Figure S2: Percentage of heterozygous differences between replicate pairs for the probe (A) and contig (B) datasets. The percentage reflects the percentage of heterozygous calls in the first named replicate type for a given comparison (e.g., percentage for S-P comparison is the percentage of discordant SNP calls for which the supernatant was heterozygous and the pellet was homozygous). The neutral expectation would be close to an even 50/50 split. Abbreviations are S = supernatant, P = pellet, F = formalin-fixed, R = RADseq.
Figure S3: A) Nucleotide diversity estimates from 298 SNPs present in> 95% of individuals from the contig dataset. B) Estimates of nucleotide diversity by replicate type for the probe dataset based on 2337 SNPs. Mean estimates of nucleotide diversity were significantly different for all comparisons
Figure S4: Percent of non-missing sites with heterozygous SNP calls for each replicate type for probe (A) and contig (B) datasets. Heterozygosity is reduced in the formalin-fixed replicates (but not significantly) compared with supernatant and pellet replicates. RADseq samples are significantly more homozygous. Note that the probe dataset has higher levels of heterozygosity in the capture-based replicates.
Figure S5: Summary of capture success by sample replicate for the probe dataset. A) Regression of extracted DNA (log scale) and number of SNPs in final matrix. B) Regression of extracted DNA and percent exogenous sequence for formalin-fixed samples, showing that samples above ~200 ng had high levels of endogenous sequence. C) Mean coverage by replicate, RADseq replicates had significantly higher coverage than the capture-based replicates. D) Percentages of reads that mapped to target sequences by replicate, formalin-fixed replicates were significantly different from all other replicates.
Figure S6: Maximum likelihood phylogeny with the highest log-likelihood for 2337 unlinked concatenated SNPs from the probe dataset estimated in IQ-TREE with 100 rapid bootstrap replicates. Gray circles represent nodes with >70% bootstrap support. Tip shapes represent replicate-type and are colored by geography (blue = Virginia, red = Ohio). Bars to the right of the phylogeny show the proportion of SNPs for each sample. All capture samples cluster by replicate, except USNM 525251, which had high levels of missing data. Within geographic clade, all RADseq samples cluster together, except USNM 525139. This differs from the contig dataset show in Fig. 4. Replicates that do not cluster where expected are shown in bold.
Figure S7: Principle component analysis (PCA) for the probe dataset. Data points are colored by geography (blue = Virginia, red = Ohio), shapes correspond to replicate-type, and size corresponds to data missingness, with larger shapes missing more data. A) PCA of 33 samples and 2337 SNPs. Samples cluster by missing data and replicate type along PC1, and by geography along PC2. B) PCA of 27 samples with >70% of SNPs. Even with low levels of missing data, differences in clustering between RADseq and capture-based replicates persist.
Figure S8: Principle component analysis (PCA) of 713 SNPs and 25 samples from the contig dataset excluding RADseq replicates and pruned of SNPs missing from >10% of individuals. Samples cluster by geography on PC1 and by amounts of missing data along PC2, with formalin-fixed samples exhibiting the highest levels of missing data separated from the rest.