Analyzing RADseq and target capture samples together
We compared two assembly strategies for generating SNP data, mapping
directly to target sequence (probe dataset) and assembling reads into
contigs and generating a pseudo-reference alignment (contigs dataset).
Although many target-capture studies utilize a variation of the contig
approach (Faircloth et al. 2016; Andermann et al. 2018), studies using
variations of the RAD-capture approach often map reads directly to the
target RADseq loci (Hoffberg et al. 2016; Lang et al. 2020), or a
reference genome (Ali et al. 2016), although Suchan et al. (2016) tested
variations of both approaches. We found that the contig method generated
almost twice as many SNPs (3997 vs. 2337), even when pruning to only the
coordinates of the original RADseq loci. However, the contig method
produced higher rates of missing data (21% vs. 9.1%) which may be
attributable to spurious mapping in the probe dataset.
When only comparing the target capture replicates, both the probe and
contig datasets performed similarly in phylogenomic and population
genomic analyses. However, when combining capture and RADseq replicates,
these two assembly strategies diverged significantly. Notably,
phylogenomic analysis with the probe dataset separated the RADseq
replicates from the capture-based replicates (Fig. S6), whereas the
contig dataset placed replicate samples from the same specimen together
correctly in the phylogeny (Fig. 4). Similarly, the probe dataset
contained elevated heterozygous and homozygous SNP call differences
between capture and RADseq replicates (Fig. 3A, B), but this pattern was
not observed in the contig dataset (Fig. 3C, D). We propose that this
difference between probe and contig datasets is best explained by better
read mapping to the longer reference sequences (454 vs. 132 bp), which
may have excluded paralogous sequences more efficiently and increased
accuracy of SNP calls. With both approaches we infer strong heterozygote
bias in the capture-based replicates compared to the RADseq replicates,
which may be explained by differences in genomic library preparation
method. Unlike the standard library preparation for RADseq, DNA for
target capture was not digested with restriction enzymes or size
selected, thus all genomic DNA was available for capture. This could
permit more paralogous sequences in the capture-based replicates despite
our data assembly method, thus introducing spurious heterozygous calls
at homozygous sites. This explanation would indicate that the RADseq
data have the “correct” calls for heterozygous sites, and it may
explain the difference in heterozygosity for the capture-based
replicates between the probe and contig datasets. Alternatively,
polymorphism at restriction cut sites (allelic dropout) in the RADseq
data can produce spurious homozygous calls, whereas the capture-based
sequence would have a “correct” heterozygous call (Luca, Hudson,
Witonsky & Di Rienzo, 2011; Gautier et al. 2013; Puritz et al. 2014).
We propose that both these processes may be contributing to the
heterozygous/homozygous imbalance we observed with a homozygous bias in
RADseq replicates causing underestimates of genetic diversity (Luca et
al. 2011; Cariou, M., Duret, L., & Charlat, 2016; Heller et al., 2021).
Although the contig dataset ameliorated some of the biases caused by
mapping directly to the target loci, analyses with the contig and probe
datasets still separated the RADseq and capture-based replicates in PC
space and generated significantly different estimates of nucleotide
diversity, regardless of SNP and individual filtering regimes. Lang et
al. (2020) pooled modern RADseq and historical target-capture samples
(using a hyRAD approach) and also inferred clustering differences
between library types. By removing variants potentially derived from
hDNA deamination (CT/TC and AG/GA), Lang et al. (2020) found that DNA
damage did not explain differential clustering of sample type. Taken
together with our study, this suggests that certain biases in SNP
calling may persist between RADseq and RAD-capture data, but some of
these biases can be reduced by mapping to either a contig-based
reference (this study; Suchan et al. 2015) or a reference genome (Ali et
al, 2016). Another solution for studies that include both hDNA and
modern samples would be to use a capture approach on all samples rather
than combining RADseq for modern samples and capture for hDNA samples,
although this would increase project costs substantially (or a RADseq
only approach as in Ewart et al., 2019).