Bioinformatic pipeline
We first cleaned and filtered raw reads using Trim Galore! v.0.6.4
(Krueger, 2015) under default settings. We screened our cleaned reads
for contamination using FastQ Screen v.0.14.1 (Wingett & Andrews, 2018)
by calculating the proportion of reads that mapped to either the human
genome (GRCh38) or a database of representative bacterial genomes (11k+
genomes) from the NCBI RefSeq database using bowtie2 v.2.3.5 (Langmead
& Salzberg, 2012). We then used FastQ Screen (no hits command) to
remove reads that mapped to either the human or bacterial genomes. We
conducted all further informatics using the Python pipeline SECAPR
v.1.1.15 (Andermann et al. 2018), which is an extension of the Phyluce
pipeline (Faircloth, 2015), as well as custom Python scripts
(github.com/kyleaoconnell22/gyro_hDNA. We generated data assemblies
using two different methods (Fig. 1). First, we used SECAPR
reference_assembly with the cleaned reads of all samples, mapping to
our target RADseq loci (probe dataset) as our reference (min coverage =
5; mean length = 135 bp). After finding significant differences in SNP
calls between the RADseq and capture-based technical replicates (see
Results), we took a de novo assembly-based approach to generate a SNP
matrix. We hypothesized that including flanking regions (sequence
adjacent to segments targeted by probes) would help filter out potential
paralogs and result in more consistent SNP calls between RADseq and
capture-based replicates. We assembled the 10 supernatant samples into
contigs using Abyss v.1.3.7 (kmer = 90; Simpson et al. 2009). We matched
assembled contigs (excluding USNM 525251 which assembled poorly) to
target sequences using the ‘secapr find_target_contigs’ command with
min-coverage = 35 (contigs were much longer than target loci),
min-identity = 80 and the keep-paralogs flag. We aligned recovered
markers using MAFFT v.7.130b (no-trim and ambiguous flags; Katoh et al.
2002), then created a consensus of each alignment using the cons command
(plurality 0.1, setcase 0.1) from the EMBOSS program (Madeira et al.
2019) run within the secapr reference_assembly script. We mapped the
cleaned reads of all samples to this new reference (contig dataset)
using the ‘reference_assembly’ script (min coverage = 5; mean length =
454 bp).
For both the probe and contig datasets, we corrected for
formalin-induced deamination using mapDamage2 v.2.0.6 (Jónsson et al.
2013), a computational framework that tracks and quantifies DNA damage
patterns in ancient and historical DNA sequence. To quantify nucleotide
damage we calculated the average rate of C > T and G
> A misincorporations for each datatype across the first 25
bp of each read calculated by mapDamage2.
For the probe dataset, we called SNPs from our corrected, unphased bam
files using BCFtools v.1.9 (Danecek & McCarthy, 2017), using the
mpileup command with maximum coverage of 1000x (-A flag). We further
pruned putative SNPs within 3 bp of an indel and removed clusters of
indels with five or fewer bp between them. We used VCFtools v.0.1.16
(Danecek et al. 2011) to remove SNPs with quality scores < 30,
non-biallelic SNPs, minor allele frequency > 0.015,
minor allele count = 2, and minimum coverage < 5x. We further
filtered for allele balance (AB > 0.25 & AB <
0.75) and required at least 70% of samples present at a site. This
approach yielded 19,601 high-quality SNPs for the probe dataset. We
chose one SNP per locus with the least missing data, leaving 2337
unlinked SNPs from 37 samples. For most analyses we removed four
formalin-fixed samples (two from each geographic region) with
<10% of SNPs, resulting in a matrix of 33 samples and 2337
unlinked SNPs with 9.1% missing data (calculated following de Medeiros
& Farrell (2018)).
Initial SNP calling in the contig dataset followed the same pipeline as
above but to ensure comparability with the probe dataset, we used
BEDTOOLS (Quinlan and Hall, 2010) and VCFtools to prune SNPs to those
called from within the original coordinates of the RADseq loci, leaving
24,005 SNPs. We chose one SNP per locus, retaining the SNP with the
least missing data, yielding a final matrix of 3997 unlinked SNPs for 33
samples with 21% missing data.