Performance benchmarking
I compared runtimes between using VCFtools to filter a series of vcf files according to simple genotype quality thresholds (minimum depth = 5, minimum genotype quality = 30), and using the SNPfiltRfunction hard_filter() to perform the same filtering protocol on the same input files. Each vcf file contained between 10K and 500K SNPs for 100 samples, and we benchmarked SNPfiltR separately under a scenario where the vcf file had already been read into the local memory of the R working environment, and a scenario where the vcf file was required to be read from disk before filtering. Across three replicates of each iteration, we found that when the vcf file had already been stored as a vcfR object in the R working environment, theSNPfiltR function hard_filter() performed filtering and returned a filtered object, on average, more rapidly than VCFtools used to perform the identical filtering (Fig. 1). Conversely, if the amount of time taken to read the vcf file into local memory as a vcfR object before filtering is counted against SNPfiltR , then this approach takes consistently longer than performing the identical filtering operation using VCFtools . This additional step of reading the vcf file into R as a vcfR object appears to increase the slope, rather than the intercept, of the line (Fig. 1), indicating that this step scales poorly as the number of SNPs in the input vcf file increases, compared to the filtering process itself whether executed using SNPfiltR or VCFtools .