FIGURE LEGENDS
Figure 1: Bioinformatic workflow used to generate SNP datasets in this
study. The probe dataset included SNPs generated by mapping directly to
RADseq-based target loci, while the contig dataset was generated by
creating a de novo reference from assembled contigs and pruning SNPs to
only those present on the original RADseq loci.
Figure 2: Summary of capture success by sample replicate for the contig
dataset of 3997 unlinked SNPs. A) Relationship between extracted DNA
(log scale) and number of SNPs in final matrix. B) Relationship between
extracted DNA and percent exogenous sequence for formalin-fixed samples,
showing that samples below ~200 ng had high levels of
exogenous sequence. C) Mean coverage by replicate. Formalin-fixed
replicates had significantly lower coverage. D) Percentages of reads
that mapped to target sequences by replicate. Formalin-fixed and RADseq
replicates were significantly different from supernatant and pellet
replicates, but not from one another.
Figure 3: Results of allelic mismatch analyses for probe (A,B) versus
contig (C,D) datasets. A) Percent of SNPs with heterozygous/homozygous
differences between technical replicates and B) Percent of SNPs with
homozygous differences in the probe dataset with 2337 SNPs. C) Percent
of SNPs with heterozygous/homozygous differences and D) Percent of SNPs
with homozygous differences in the contig dataset with 3997 SNPs. Note
that the probe dataset (A, B) shows consistently higher allelic
differences in comparisons of RADseq and capture-based replicates
whereas the contig dataset (C, D) shows similar allelic differences
across all comparisons. Data for individual comparisons is given in full
in Table 2. Abbreviations are S = supernatant, P = pellet, F =
formalin-fixed, R = RADseq.
Figure 4: Maximum likelihood phylogeny with the highest log-likelihood
for 3997 unlinked concatenated SNPs from the contig dataset estimated in
IQ-TREE with 100 rapid bootstrap replicates. Gray circles represent
nodes with >70% bootstrap support. Tip shapes represent
replicate-type and are colored by geography (blue = Virginia, red =
Ohio). Bars to the right of the phylogeny show the proportion of SNPs
for each sample. All replicates cluster by specimen and by geography
despite high levels of missing data in some replicates.
Figure 5: Principle component analysis (PCA) for the contig dataset.
Data points are colored by geography (blue = Virginia, red = Ohio),
shapes correspond to replicate-type, and size corresponds to data
missingness, with larger shapes missing more data. Gray arrows highlight
RADseq technical replicates. A) PCA of 33 samples and 3997 SNPs. Samples
cluster by geography along PC1 and missing data and replicate type along
PC2. B) PCA of 27 samples with >70% of the 3997 SNPs. Even
with low levels of missing data, differences in clustering between
RADseq and capture-based replicates are apparent.
Figure 6: Estimates of nucleotide diversity by replicate type for 3997
SNPs from the contig dataset based on 100 estimates with sample sizes
ranging from five to ten individuals per replicate. Mean estimates of
nucleotide diversity for formalin-fixed and RADseq replicates were
significantly lower than supernatant and pellet replicates.
Supporting Information Figures:
Figure S1: Regression of extracted DNA against proportion of SNPs in the
contig dataset for formalin-fixed samples.
Figure S2: Percentage of heterozygous differences between replicate
pairs for the probe (A) and contig (B) datasets. The percentage reflects
the percentage of heterozygous calls in the first named replicate type
for a given comparison (e.g., percentage for S-P comparison is the
percentage of discordant SNP calls for which the supernatant was
heterozygous and the pellet was homozygous). The neutral expectation
would be close to an even 50/50 split. Abbreviations are S =
supernatant, P = pellet, F = formalin-fixed, R = RADseq.
Figure S3: A) Nucleotide diversity estimates from 298 SNPs present in> 95% of individuals from the contig dataset. B)
Estimates of nucleotide diversity by replicate type for the probe
dataset based on 2337 SNPs. Mean estimates of nucleotide diversity were
significantly different for all comparisons
Figure S4: Percent of non-missing sites with heterozygous SNP calls for
each replicate type for probe (A) and contig (B) datasets.
Heterozygosity is reduced in the formalin-fixed replicates (but not
significantly) compared with supernatant and pellet replicates. RADseq
samples are significantly more homozygous. Note that the probe dataset
has higher levels of heterozygosity in the capture-based replicates.
Figure S5: Summary of capture success by sample replicate for the probe
dataset. A) Regression of extracted DNA (log scale) and number of SNPs
in final matrix. B) Regression of extracted DNA and percent exogenous
sequence for formalin-fixed samples, showing that samples above
~200 ng had high levels of endogenous sequence. C) Mean
coverage by replicate, RADseq replicates had significantly higher
coverage than the capture-based replicates. D) Percentages of reads that
mapped to target sequences by replicate, formalin-fixed replicates were
significantly different from all other replicates.
Figure S6: Maximum likelihood phylogeny with the highest log-likelihood
for 2337 unlinked concatenated SNPs from the probe dataset estimated in
IQ-TREE with 100 rapid bootstrap replicates. Gray circles represent
nodes with >70% bootstrap support. Tip shapes represent
replicate-type and are colored by geography (blue = Virginia, red =
Ohio). Bars to the right of the phylogeny show the proportion of SNPs
for each sample. All capture samples cluster by replicate, except USNM
525251, which had high levels of missing data. Within geographic clade,
all RADseq samples cluster together, except USNM 525139. This differs
from the contig dataset show in Fig. 4. Replicates that do not cluster
where expected are shown in bold.
Figure S7: Principle component analysis (PCA) for the probe dataset.
Data points are colored by geography (blue = Virginia, red = Ohio),
shapes correspond to replicate-type, and size corresponds to data
missingness, with larger shapes missing more data. A) PCA of 33 samples
and 2337 SNPs. Samples cluster by missing data and replicate type along
PC1, and by geography along PC2. B) PCA of 27 samples with
>70% of SNPs. Even with low levels of missing data,
differences in clustering between RADseq and capture-based replicates
persist.
Figure S8: Principle component analysis (PCA) of 713 SNPs and 25 samples
from the contig dataset excluding RADseq replicates and pruned of SNPs
missing from >10% of individuals. Samples cluster by
geography on PC1 and by amounts of missing data along PC2, with
formalin-fixed samples exhibiting the highest levels of missing data
separated from the rest.