Bioinformatic pipeline
We first cleaned and filtered raw reads using Trim Galore! v.0.6.4 (Krueger, 2015) under default settings. We screened our cleaned reads for contamination using FastQ Screen v.0.14.1 (Wingett & Andrews, 2018) by calculating the proportion of reads that mapped to either the human genome (GRCh38) or a database of representative bacterial genomes (11k+ genomes) from the NCBI RefSeq database using bowtie2 v.2.3.5 (Langmead & Salzberg, 2012). We then used FastQ Screen (no hits command) to remove reads that mapped to either the human or bacterial genomes. We conducted all further informatics using the Python pipeline SECAPR v.1.1.15 (Andermann et al. 2018), which is an extension of the Phyluce pipeline (Faircloth, 2015), as well as custom Python scripts (github.com/kyleaoconnell22/gyro_hDNA. We generated data assemblies using two different methods (Fig. 1). First, we used SECAPR reference_assembly with the cleaned reads of all samples, mapping to our target RADseq loci (probe dataset) as our reference (min coverage = 5; mean length = 135 bp). After finding significant differences in SNP calls between the RADseq and capture-based technical replicates (see Results), we took a de novo assembly-based approach to generate a SNP matrix. We hypothesized that including flanking regions (sequence adjacent to segments targeted by probes) would help filter out potential paralogs and result in more consistent SNP calls between RADseq and capture-based replicates. We assembled the 10 supernatant samples into contigs using Abyss v.1.3.7 (kmer = 90; Simpson et al. 2009). We matched assembled contigs (excluding USNM 525251 which assembled poorly) to target sequences using the ‘secapr find_target_contigs’ command with min-coverage = 35 (contigs were much longer than target loci), min-identity = 80 and the keep-paralogs flag. We aligned recovered markers using MAFFT v.7.130b (no-trim and ambiguous flags; Katoh et al. 2002), then created a consensus of each alignment using the cons command (plurality 0.1, setcase 0.1) from the EMBOSS program (Madeira et al. 2019) run within the secapr reference_assembly script. We mapped the cleaned reads of all samples to this new reference (contig dataset) using the ‘reference_assembly’ script (min coverage = 5; mean length = 454 bp).
For both the probe and contig datasets, we corrected for formalin-induced deamination using mapDamage2 v.2.0.6 (Jónsson et al. 2013), a computational framework that tracks and quantifies DNA damage patterns in ancient and historical DNA sequence. To quantify nucleotide damage we calculated the average rate of C > T and G > A misincorporations for each datatype across the first 25 bp of each read calculated by mapDamage2.
For the probe dataset, we called SNPs from our corrected, unphased bam files using BCFtools v.1.9 (Danecek & McCarthy, 2017), using the mpileup command with maximum coverage of 1000x (-A flag). We further pruned putative SNPs within 3 bp of an indel and removed clusters of indels with five or fewer bp between them. We used VCFtools v.0.1.16 (Danecek et al. 2011) to remove SNPs with quality scores < 30, non-biallelic SNPs, minor allele frequency > 0.015, minor allele count = 2, and minimum coverage < 5x. We further filtered for allele balance (AB > 0.25 & AB < 0.75) and required at least 70% of samples present at a site. This approach yielded 19,601 high-quality SNPs for the probe dataset. We chose one SNP per locus with the least missing data, leaving 2337 unlinked SNPs from 37 samples. For most analyses we removed four formalin-fixed samples (two from each geographic region) with <10% of SNPs, resulting in a matrix of 33 samples and 2337 unlinked SNPs with 9.1% missing data (calculated following de Medeiros & Farrell (2018)).
Initial SNP calling in the contig dataset followed the same pipeline as above but to ensure comparability with the probe dataset, we used BEDTOOLS (Quinlan and Hall, 2010) and VCFtools to prune SNPs to those called from within the original coordinates of the RADseq loci, leaving 24,005 SNPs. We chose one SNP per locus, retaining the SNP with the least missing data, yielding a final matrix of 3997 unlinked SNPs for 33 samples with 21% missing data.