Discussion
Historically, programs designed for performing computationally intensive
bioinformatic processes have rarely been implemented in the R language
because the requirement that datasets be read into local memory can
cause computational bottlenecks with large input file sizes. Here I
showed that the R package SNPfiltR can be used to filter moderate
sized reduced-representation SNP datasets with runtimes comparable to
state-of-the-art programs implemented in highly efficient languages such
as Perl and C++. While benchmarking confirmed that reading large files
into the local memory of an R working environment scales poorly with
increasing input file size, the vcfR and SNPfiltR packages
can be used in tandem to read and quality filter a SNP dataset
containing 50M genotypes and associated quality information in less than
two minutes on a personal laptop. This size SNP dataset (50M genotypes,
or 500K genotypes for 100 samples) is realistic for a set of unfiltered
SNP calls resulting from a moderate to large sized
reduced-representation genomic sequencing project, indicating that the
computational power of the R language has been generally overlooked for
the purposes of processing and filtering reduced-representation genomic
SNP datasets. SNPfiltR takes advantage of this previously
overlooked computational power, and unlike existing programs designed
for SNP filtering, harnesses the widely commended data visualization
capabilities of R, allowing users to design an interactive and
customizable SNP filtering pipelines within a single R script.
While many existing R packages are capable of working with SNP data, no
existing R package contains functions for automated visualization and
filtering of SNP data comparable to those offered by SNPfiltR . A
few packages focus on directly reading and manipulating SNP data (e.g.,vcfR (Knaus & Grünwald, 2017) and dartR (Gruber et al.,
2018)), but largely require custom scripting using R syntax if users
wish to filter and visualize their SNP datasets, leaving a need for
automated SNP visualization and filtering functions. SNPfiltR is
complementary to these packages, extending their functionalities with
modular functions that automate key visualization and filtering steps,
allowing the rapid generation of full SNP filtering pipelines in R.
Notably, functions from the SNPfiltR package rely on vcfR objects
as input, which can be directly read in from vcf files using the
function read.vcfR() from the vcfR package. For this
reason, we strongly recommend that users of the SNPfiltR package
also cite the vcfR package as part of their integrative SNP
filtering pipelines. A suite of additional R packages exist for
performing downstream phylogenetic and population genetic analyses on
high-quality SNP datasets (e.g., APE (Paradis & Schliep, 2019),stAMPP (Pembleton et al., 2013), SNPrelate (Zheng et al.,
2012), adegenet (Jombart, 2008), sambaR (de Jong et al.,
2021), and introgress (Gompert & Buerkle, 2010)).SNPfiltR is complementary to these packages as well, as eachSNPfiltR function returns a filtered vcfR object which can be
easily converted into a myriad of object classes within R for further
analysis using any of these dedicated population genetic programs.
It is widely accepted that the universe of elegant, open-source R based
tools such as Rstudio and Rmarkdown allow for exceptional interactivity
and reproducibility (Gandrud, 2018). Additionally, the performance
benchmarking results presented here indicate that the computational
power of the R programming language is sufficient for analyzing most
reduced-representation SNP datasets, despite that this practice seems
relatively rare. The SNPfiltR package takes advantage of this
previously unrecognized opportunity and provides custom functions
designed to fully integrate the investigation, visualization, and
filtering of a SNP dataset into a single coherent R framework. The
filtering functions offered by SNPfiltR perform competitively
with current state of the art SNP filtering programs on moderately sized
datasets, indicating that bioinformaticians ought to consider
implementing fully R-based pipelines for streamlining the often
complicated and iterative process of optimizing filtering parameters for
next-generation sequencing datasets. By extending the current
bioinformatic tools available in R for filtering SNP datasets, theSNPfiltR package will allow users to spend less time
investigating and testing filtering parameters, and more time resolving
evolutionary mysteries with genomic data.