Novel functions for visualizing and filtering SNP datasets in R
The SNPfiltR package relies on the efficient import and export functions of the vcfR package to efficiently read vcf files into the local memory of an R working environment as vcfR objects, and to write vcfR objects to disc as gzipped vcf files. Once a vcf file has been read into the local R working environment as a vcfR object, it is immediately available in proper input format for all SNPfiltRfunctions. Each SNPfiltR function can be run without specified thresholds or cutoffs, (e.g., hard_filter(vcfR=vcfR.object) ) to visualize the parameter space that will be filtered, without performing filtering, allowing users to quickly make informed decisions based on patterns specific to their datasets, and implement their chosen filtering thresholds (e.g., hard_filter(vcfR=vcfR.object, depth=5, gq=30) ). SNPfiltR contains a suite of commonly implemented filters for genomic datasets, including filtering based on genotype quality, minimum and maximum read depth, allele balance, number of alleles present, missing data per sample, missing data per SNP, minor allele count, and physical linkage. While most of these filters can be implemented in other programs (e.g., VCFtools and GATK ),SNPfiltR is the first program offering dedicated functions for a comprehensive suite of SNP visualization and filtering options. Each SNP filtering function can be implemented or skipped at the discretion of the user, to build an interactive SNP filtering pipeline customized to the specific needs of a given genomic dataset.
Beyond simply filtering, I also developed functions to automate the process of investigating the effects of missing data on a SNP dataset. The SNPfiltR functions assess_missing_data_pca() andassess_missing_data_tsne() are designed to perform dimensionality reduction on highly multi-dimensional SNP datasets, using principal components analysis (PCA) via the R package adegenet (Jombart, 2008) and t-distributed stochastic neighbor embedding implemented via the R package Rtsne (Krijthe & van der Maaten, 2015), respectively. Each of these functions then visualizes the similarity between input samples in two-dimensional space, across user specified missing data per SNP thresholds. Users also have the option to perform unsupervised clustering to assign samples to groups without a-priori information using Partitioning Around Medoids (PAM) implemented internally via the R package cluster (Maechler et al., 2018), by setting clustering = TRUE, if they wish to assess the effect of missing data on objective sample clustering assignments. Finally each of these functions will generate an additional visualization of sample similarity in two-dimensional space with samples color-coded by missing data proportion, allowing the user to visually assess whether missing data is driving patterns of sample clustering. These investigative functions can be used in tandem with the functions missing_by_snp() andmissing_by_sample() , in order to ensure that user specified missing data thresholds both per sample and per SNP are sufficient for mitigating the effects of missing data in driving patterns of sample clustering for your specific dataset before performing downstream population genetic or phylogenetic analyses.