Demonstrating the use of SCNIC
We demonstrate the use of SCNIC with two example datasets. These are 1) a study that used 16S rRNA sequencing of fecal material to compare microbiome composition in individuals with and without HIV and in men who have sex with men (MSM) who were at a high risk of contracting HIV [43], and 2) a dataset analyzing the microbiome of water samples at various depths in two of the Great Lakes. We chose these two datasets so that we could evaluate performance using datasets from both host-associated and free-living microbiomes. We also used the Great Lakes dataset to compare module size and modularity between SMD and LMM selected modules.
HIV dataset:
The HIV data set was retrieved from NCBI SRA accession number SRP068240, and samples from the BCN0 cohort were used for these analyses. Reads were error corrected, quality trimmed, and primers were removed using default parameters in BBTools [44]. DADA2 [45] was used to define amplicon sequence variants (ASVs) with reads trimmed from the left by 30 base pairs and truncated at 269. ASVs were binned into operational taxonomic units (OTUs) using USEARCH [46] at 99% identity using QIIME 1 [47]. A phylogenetic tree was made using a single representative sequence from each OTU and the SEPP protocol [48, 49] using QIIME 2 [34]. We evaluated the average phylogenetic distance between OTUs in the same module using thedistance method of Biopython [50, 51]. Taxonomy was assigned using the Naive Bayes QIIME 2 feature classifier, version gg-13-8-99-515-806-nb-classifier.qza.
The original study describing these data showed a strong divergence in gut microbiome composition in MSM compared to non-MSM independent of HIV infection status and more subtle differences associated with HIV infection when controlling for MSM behavior. The goal of our analysis was to evaluate whether comparing gut microbiome composition between HIV negative MSM and non-MSM with SCNIC modules provide additional significant taxa compared to without, and additional insights as to which taxa that differ with MSM also are in turn demonstrating co-correlated structure with each other. Co-correlation of microbes may indicate that they are a part of a broader community type, interact with each other, or have shared environmental drivers of their prevalence. A further goal of this analysis is to examine the effects of using different R-value thresholds on the results. The SMD method was specifically used with SparCC R-value thresholds between 0.20 and 1.0, with 0.05 increments.
Great Lakes dataset
The Great Lakes dataset was previously published as part of the Earth Microbiome Project [52]. This study evaluated patterns of microbial relative abundance across depths in Lake Michigan (N=16) and Lake Superior (N=33), with depth of samples collected ranging from 5 to 3654 meters. The study additionally recorded data on pH and salinity. The Great Lakes data set was retrieved from QIITA accession number 1041 [53]. ASVs were found using DADA2 with a left trim of 30 and a truncation length of 135. OTUs were subsequently picked on the ASVs using VSEARCH [54] with a 99% identity threshold, resulting in 3,871 OTUs. These steps were done with QIIME 2 [34]. SCNIC was applied with the SMD method and .2, .4 and .65 R-value thresholds.
Comparison of SMD to LMM using the Great Lakes dataset
To identify differences in module structure from SMD versus LMM partitions, we assessed the module size and modularity of 221 separately partitioned networks from the Great Lakes dataset using varying parameters for SCNIC. The parameters included SCNIC R thresholds ranging from 0.1 to 0.7 and gamma ranging from 0.15 to 0.9 for LMM.