Missing data filtering for a scrub-jay RADseq SNP dataset
I then utilized the dedicated visualization tools offered bySNPfiltR to investigate patterns of missing data by individual sample and by SNP for this quality filtered scrub-jay SNP dataset (Fig. 3). The function missing_by_sample() reveals that missing data is distributed relatively equally across a priori identified species groups, and that with all 115 samples included, there are hardly any SNPs that reach a 90% completeness threshold. A visualization of the proportion of missing genotype calls in each sample shows that samples vary along a relatively continuous distribution from missing less than 20% of genotype calls to missing nearly 100% of genotype calls. Using the missing_by_sample() function, I filtered with a proportion missing genotypes per sample threshold of 81%, resulting in 20 samples being dropped from the dataset (Fig. 3). Because SNPs may have become invariant if all minor allele genotypes were removed when these samples were dropped, I again implemented a minor allele count filter, with a minimum of one minor allele genotype per SNP, to remove invariant sites, resulting in .61% of remaining SNPs being dropped.
I then used the SNPfiltR function missing_by_snp() to visualize the proportion of missing data in each sample across a reasonable set of potential per-SNP completeness thresholds (Fig. 3). This visualization shows a continuous distribution of missing data within retained samples and no visible outlier samples, indicating that we have successfully dropped problematic samples from the dataset. Dotplots show a strong negative correlation between total proportion missing data and the total number of SNPs retained in the dataset, across potential per-SNP filtering thresholds. I chose to implement a per-SNP completeness cutoff of 85% using the functionmissing_by_snp() , resulting in a final, quality and missing data filtered SNP dataset containing 95 samples, 16,307 SNPs, and 5.7% total missing genotypes (Fig. 3).
To ensure that the implemented 85% missing data threshold effectively prevents patterns of missing data within individuals from driving overall clustering patterns, I then used the functionassess_missing_data_pca() to visualize sample clustering across 75% and 85% completeness per SNP completeness thresholds (Fig. 4). At both thresholds, all samples visually cluster according to a priori assignment to species groups. When samples are colored according to proportion missing data, it becomes evident that within species groups, samples with the most missing data are clustered the least tightly, indicating increased uncertainty in assignment. Between the 75% and 85% per SNP completeness thresholds, the more restrictive threshold slightly reduces the effect of missing data in these most loosely assigned samples (Fig. 4). Sample clustering using t-SNE reveals additional population substructure within species groups and shows no indication that missing data is driving patterns of clustering either between or within groups (Fig. 4). A final filter for physical linkage, using the SNPfiltR function distance_thin() to remove all SNPs separated by less than 500 base-pairs, resulted in a quality and missing data filtered, unlinked SNP dataset of 2,803 SNPs ready for input in downstream analyses.