Performance benchmarking
I compared runtimes between using VCFtools to filter a series of
vcf files according to simple genotype quality thresholds (minimum depth
= 5, minimum genotype quality = 30), and using the SNPfiltRfunction hard_filter() to perform the same filtering protocol on the
same input files. Each vcf file contained between 10K and 500K SNPs for
100 samples, and we benchmarked SNPfiltR separately under a
scenario where the vcf file had already been read into the local memory
of the R working environment, and a scenario where the vcf file was
required to be read from disk before filtering. Across three replicates
of each iteration, we found that when the vcf file had already been
stored as a vcfR object in the R working environment, theSNPfiltR function hard_filter() performed filtering and returned
a filtered object, on average, more rapidly than VCFtools used to
perform the identical filtering (Fig. 1). Conversely, if the
amount of time taken to read the vcf file into local memory as a vcfR
object before filtering is counted against SNPfiltR , then this
approach takes consistently longer than performing the identical
filtering operation using VCFtools . This additional step of
reading the vcf file into R as a vcfR object appears to increase the
slope, rather than the intercept, of the line (Fig. 1), indicating that
this step scales poorly as the number of SNPs in the input vcf file
increases, compared to the filtering process itself whether executed
using SNPfiltR or VCFtools .