Quality filtering for a scrub-jay RADseq SNP dataset
To provide an example of using SNPfiltR and vcfR to build a
comprehensive, R-based SNP filtering pipeline, I began by reading an
unfiltered vcf file into my R working environment using the functionread.vcfR() from the vcfR package, resulting in a vcfR
object containing 210,336 SNPs for 115 samples,derived from RAD
sequencing of samples from throughout the range of the scrub-jays. Using
the SNPfiltR function hard_filter() to filter genotypes
to minimum thresholds of a depth of 5 reads, and a genotype quality of
30 resulted in the removal of 32.92% and 2.01% of called genotypes,
respectively (Fig. 2). Filtering to retain only bi-allelic loci using
the function filter_biallelic() , removed no SNPs, as there were
no SNPs with more than two alleles. An additional quality filter offered
by SNPfiltR is the function filter_allele_balance() ,
which implements a quality filter based on allele balance, or the ratio
of reads for each allele in called heterozygous genotypes. This function
follows the recommendations of Ddocent (Puritz et al., 2014), by
removing called heterozygous genotypes where the allele balance falls
outside of the range .25-.75, and resulted in the removal of 7.56% of
heterozygous genotypes, or .39% of all called genotypes, in our
scrub-jay SNP dataset. Next, I implemented a maximum depth filter, which
is crucial for RAD datasets, where paralogously assembled loci (which
will display outlier sequencing depth) can result in problematic and
misleading downstream inferences (O’Leary et al., 2018). Because depth
of coverage can vary widely based on project design, understanding the
distribution of read depths across all called SNPs is essential for
making an informed decision about a maximum depth cutoff. I used theSNPfiltR function max_depth() to visualize the
distribution of mean read depth per sample for all called SNPs, and then
set a maximum mean depth cutoff of 100 reads, resulting in 12.85% of
all SNPs being removed from the dataset. As a final quality filtering
step, I implemented a minor allele count filter using theSNPfiltR function min_mac() in order to remove invariant
SNPs resulting from genotype-based quality filtering steps, which may
have resulted in all minor allele genotype calls being converted to ‘NA’
for individual SNPs. This filter requiring one minor allele in each SNP
removed of 55.87% of SNPs, resulting in 80,885 quality filtered SNPs
remaining in the filtered vcfR object for further investigation (Fig.
2).