Predictors of formalin-fixed sample success
The amount of template DNA extracted from formalin-fixed samples weakly predicted the number of high-quality SNPs (>5x coverage; adjusted R2=0.28), suggesting that ~200 ng of extracted template was needed to recover> 25% of SNPs at > 5x coverage (Fig. 2A; S1; S5). Thus, we recommend dividing extractions into multiple replicates (from several pieces of tissue) to extract more DNA from samples that give low DNA yield. However, a greater amount of extracted endogenous DNA does not necessarily ensure downstream success as a variety of factors can degrade DNA quality in formalin-fixed samples, including specimen age, exposure to UV, temperature, and length of formalin exposure (Hykin et al. 2015; Sawyer et al. 2012). Historical samples typically contain highly fragmented DNA (Pääbo, 1989; Ewart et al. 2019), and this could affect library preparation if most fragments are too short for target probes to bind efficiently, even if relatively high amounts of DNA were extracted. The large genomes of amphibians may also require higher extraction yields (~200 ng in this study) to successfully capture genome-wide targets (McCartney‐Melstad et al., 2016), whereas studies of formalin-fixed reptiles have reported successful sequence capture with as little as 1–3 ng/μl (Hykin et al. 2015; Ruane & Austin, 2017).
In addition, formalin-fixed sample extractions may contain high levels of exogenous DNA, particularly when endogenous DNA yield is low. In the four formalin-fixed samples with <10% of SNPs, levels of exogenous sequence were all >30%, and as high as 81%. The other six samples yielded > 94% endogenous sequence, suggesting that the level of exogenous sequence is a strong predictor of sample success. Rates of exogenous DNA from fluid-preserved specimens have not been quantified in many studies, but Hykin et al. (2015) found low rates of exogenous sequence in a formalin-fixed lizard (only 0.27% of reads). By contrast, Lyra et al. (2020) extracted DNA from ethanol-preserved frogs and identified a high proportion of bacterial reads (based on BLAST search), and a low fraction of endogenous sequence (<0.5% mapped to closely-related reference transcriptome). Thus, it remains an open question how much endogenous DNA should be expected from formalin-fixed extractions. Two of the samples in this study with high rates of contamination were larval samples that had been stored in formalin for several years. The other two samples were adult specimens, and we are uncertain if the contamination occurred prior to or during tissue subsampling, or if the tissue subsamples had such low usable DNA that any exogenous DNA present was preferentially amplified (Pääbo 1989).
Another factor that may impact sample outcomes is the tissue type used for extractions. Studies seeking to extract DNA from formalin-fixed samples typically sample liver or muscle tissue (Hykin et al. 2015, Ruane & Austin, 2017; Pierson et al. 2020). Hykin et al. (2015) compared extraction success between these two tissue types and extracted higher yields from the liver replicates of Anolis lizard samples. Ruane and Austin (2017) successfully extracted DNA from snake liver tissues, while Pierson et al. (2020) were unable to extract suitable DNA for PCR or library preparation from salamander tail muscle. Here we compared success between muscle and liver replicates of specimen USNM 525133. We inferred double the rate of human contamination in the liver replicate (6.3%) than in the muscle replicate (2.9%), but by all other measures the liver replicate outperformed the muscle replicate, including total DNA extracted, fragment length, total loci, total SNPs, and average coverage. Taken together, these results suggest that DNA in formalin-fixed specimens may remain better preserved in liver than in muscle tissue, but future studies could test this hypothesis with larger sample sizes and with samples of various ages.