Introduction
As next-generation (i.e. massively parallel, short read) sequencing has
become the ubiquitous approach for population genetic and phylogenetic
investigations, SNP (single-nucleotide polymorphism) datasets have
quickly become the standard input data for a wide array of phylogenetic
and population genetic analyses (Toews et al., 2015). SNP datasets
typically contain thousands to millions of base-calls for each
individual sample (i.e., genotypes), located at thousands to millions of
variable sites (i.e., SNPs) throughout the genome of the focal taxa. For
storing these genotype calls along with associated quality information,
the standardized and efficient formatting of the vcf (variant call
format) file has become the widely accepted standard (Danecek et al.,
2011). By retaining only variant sites (i.e., called SNPs), large
population genomic datasets can be stored in vcf files which are
manageable both in terms of file size and computational time required
for downstream analyses.
As SNP datasets stored and vcf files have become standard input for
population genetic and phylogenetic analyses, programs to call SNPs from
next generation sequencing data have proliferated rapidly (e.g.,GATK (McKenna et al., 2010), SAMtools (Danecek et al.,
2021), Stacks (Rochette et al., 2019), Ddocent (Puritz et
al., 2014), ipyrad (Eaton & Overcast, 2020)). While these
programs are optimized to call SNPs rapidly and accurately, called SNPs
will inevitably suffer from a variety of technical issues such as
sequencing error, paralagous assembly, and missing data, all of which
need to be addressed before performing downstream analyses (O’Leary et
al., 2018). Furthermore, individual samples may suffer from low
sequencing coverage or contamination, preventing confident use in
downstream analysis, necessitating their removal from the dataset (Cerca
et al., 2021). To address these issues, some SNP calling programs
contain built in functionality for filtering output SNP datasets (e.g.,GATK and Stacks ), but these filtering options often leave
much to be desired for investigators hoping to perform thorough
explorations of parameter space and deeply understand the nuances of
their particular SNP dataset. This limited functionality results in a
gap in many bioinformatic pipelines, especially for SNP datasets
generated via reduced-representation genomic sequencing where de novo
assembly and rampant missing data often necessitate careful filtering in
order to maximize retained signal while minimizing systematic error
(O’Leary et al., 2018).
Currently, this bioinformatic gap is often addressed via informal data
visualizations implemented across multiple programs and programming
languages, as investigators attempt to confirm that technical issues and
missing data are not influencing downstream inferences, before choosing
a final set of SNP filtering parameters. As reproducibility is becoming
more widely acknowledged as critical for the future of science,
effectively documenting this often iterative process of investigating,
visualizing, and cleaning large datasets continues to be a major
challenge. Current state of the art programs for filtering SNP datasets
such as GATK and VCFtools (Danecek et al., 2011) are
highly efficient and parallelizable, making them especially useful for
massive datasets, but their command-line interfaces do not lend
themselves easily to graphical visualization, leading many investigators
to rely on homebrewed scripts for preliminary data investigation. A
common homebrewed approach involves generating summary statistic files
using a command-line based program or Unix scripting, for investigation
and visualization using the renowned data visualization tools of the R
(R Core Team, 2019) computing language. This approach requires moving
between scripting languages and writing custom code to perform
visualizations, creating a steep learning curve for inexperienced users
and resulting in pipelines that may be error prone and difficult to
document. Based on the ubiquity of this approach, there is clearly an
outstanding need within the field of evolutionary genomics for
user-friendly software that automates and streamlines the process of
investigating, visualizing, and filtering SNP datasets.
The R package SNPfiltR is designed to fill this bioinformatic gap
with a suite of custom functions designed for investigating,
visualizing, and filtering reduced-representation SNP datasets within a
coherent R-based framework. As input, these functions take SNP datasets
stored as standard vcf files, read into an R working environment as
‘vcfR’ objects, which can be performed in a single step using the
function read.vcfR() from the R package vcfR (Knaus & Grünwald,
2017). This suite of custom functions from SNPfiltR can then generate
automated visualizations of key parameters such as depth, quality, and
missing data, and allow users to specify appropriate filtering
thresholds based on the nuances of their particular dataset, for rapid
filtering directly in R. Functions from SNPfiltR can be used in
concert with import and export functions from vcfR , in order to
generate an interactive, customizable SNP filtering pipeline, all within
a single R script. The package is publicly available on CRAN via the
syntax [install.packages(“SNPfiltR”)], the development version is
available from GitHub at: (github.com/DevonDeRaad/SNPfiltR) and the
entire package is thoroughly documented including descriptions and
examples for each function and multiple comprehensive vignettes, at the
website: (devonderaad.github.io/SNPfiltR/).