Analyzing RADseq and target capture samples together
We compared two assembly strategies for generating SNP data, mapping directly to target sequence (probe dataset) and assembling reads into contigs and generating a pseudo-reference alignment (contigs dataset). Although many target-capture studies utilize a variation of the contig approach (Faircloth et al. 2016; Andermann et al. 2018), studies using variations of the RAD-capture approach often map reads directly to the target RADseq loci (Hoffberg et al. 2016; Lang et al. 2020), or a reference genome (Ali et al. 2016), although Suchan et al. (2016) tested variations of both approaches. We found that the contig method generated almost twice as many SNPs (3997 vs. 2337), even when pruning to only the coordinates of the original RADseq loci. However, the contig method produced higher rates of missing data (21% vs. 9.1%) which may be attributable to spurious mapping in the probe dataset.
When only comparing the target capture replicates, both the probe and contig datasets performed similarly in phylogenomic and population genomic analyses. However, when combining capture and RADseq replicates, these two assembly strategies diverged significantly. Notably, phylogenomic analysis with the probe dataset separated the RADseq replicates from the capture-based replicates (Fig. S6), whereas the contig dataset placed replicate samples from the same specimen together correctly in the phylogeny (Fig. 4). Similarly, the probe dataset contained elevated heterozygous and homozygous SNP call differences between capture and RADseq replicates (Fig. 3A, B), but this pattern was not observed in the contig dataset (Fig. 3C, D). We propose that this difference between probe and contig datasets is best explained by better read mapping to the longer reference sequences (454 vs. 132 bp), which may have excluded paralogous sequences more efficiently and increased accuracy of SNP calls. With both approaches we infer strong heterozygote bias in the capture-based replicates compared to the RADseq replicates, which may be explained by differences in genomic library preparation method. Unlike the standard library preparation for RADseq, DNA for target capture was not digested with restriction enzymes or size selected, thus all genomic DNA was available for capture. This could permit more paralogous sequences in the capture-based replicates despite our data assembly method, thus introducing spurious heterozygous calls at homozygous sites. This explanation would indicate that the RADseq data have the “correct” calls for heterozygous sites, and it may explain the difference in heterozygosity for the capture-based replicates between the probe and contig datasets. Alternatively, polymorphism at restriction cut sites (allelic dropout) in the RADseq data can produce spurious homozygous calls, whereas the capture-based sequence would have a “correct” heterozygous call (Luca, Hudson, Witonsky & Di Rienzo, 2011; Gautier et al. 2013; Puritz et al. 2014). We propose that both these processes may be contributing to the heterozygous/homozygous imbalance we observed with a homozygous bias in RADseq replicates causing underestimates of genetic diversity (Luca et al. 2011; Cariou, M., Duret, L., & Charlat, 2016; Heller et al., 2021).
Although the contig dataset ameliorated some of the biases caused by mapping directly to the target loci, analyses with the contig and probe datasets still separated the RADseq and capture-based replicates in PC space and generated significantly different estimates of nucleotide diversity, regardless of SNP and individual filtering regimes. Lang et al. (2020) pooled modern RADseq and historical target-capture samples (using a hyRAD approach) and also inferred clustering differences between library types. By removing variants potentially derived from hDNA deamination (CT/TC and AG/GA), Lang et al. (2020) found that DNA damage did not explain differential clustering of sample type. Taken together with our study, this suggests that certain biases in SNP calling may persist between RADseq and RAD-capture data, but some of these biases can be reduced by mapping to either a contig-based reference (this study; Suchan et al. 2015) or a reference genome (Ali et al, 2016). Another solution for studies that include both hDNA and modern samples would be to use a capture approach on all samples rather than combining RADseq for modern samples and capture for hDNA samples, although this would increase project costs substantially (or a RADseq only approach as in Ewart et al., 2019).