The SCNIC method
SCNIC takes a feature table containing counts of each feature in all samples as input and performs three steps: 1) a correlation network is built, 2) modules are detected in the network and 3) feature counts within a module are summed into a new single feature (identified as “module-x ” where x is whole numbered consecutively starting at zero)(Figure 1). The modules are ordered based on size, where the lower numbered modules have a larger number of members compared to higher numbered modules. To summarize modules, SCNIC uses a sum of count data from all features in a module. There is no maximum or minimum size constraint on module size when modules are created. The newly generated modules are included in a new feature table alongside all features not grouped into a module. This maintains the total counts per sample, allowing for downstream analyses with tools that have assumptions related to total sample counts. SCNIC produces a graph modeling language (GML) format [35] file compatible with Cytoscape [36] for network visualization in which the edges in the correlation network represent the positive correlations which are stronger than a user specified R-value cutoff (between 0 and 1), a file describing which features compose each defined module, and a feature table in the Biological Observation Matrix (BIOM) [37] (Figure 1).
SCNIC allows users to choose between multiple methods for detecting correlations and of defining modules of co-occurring microbes. For correlations, SCNIC can implement traditional correlation metrics (including Pearson’s r , Spearman’s and Kendall’s τ) or the compositionality- and sparsity-aware correlation metric from SparCC [38, 39] to correct for aspects of microbiome data. SparCC has been shown to perform well in detecting correlations compared to other correlation measures [13]. Specifically, SparCC performs well in communities with an inverse Simpson index above 13 (which would be indicative of a high number of successful species, a complex food web, and many ecological niches, as would be seen in many high biomass microbial communities such as gut or soil microbiomes) [39,40], and it thus was chosen as the default metric.
To define modules of co-correlated features, we implement two methods: 1) Louvain modularity maximization (LMM) and 2) a novel shared minimum distance (SMD) module detection algorithm; unlike WGCNA, neither of these algorithms make assumptions about network topology. LMM was previously proposed as a method for clustering correlation networks of microbes into modules [30]. LMM works by first assigning one module per feature. Each pair of adjacent modules are joined and the change in modularity (defined by the number of edges within the module compared to outside) is calculated for each module. The pair which increases the mean modularity of the network the most is then joined. This process is repeated until the modularity of the network is not increased. LMM uses two parameters provided by the user: The first parameter, R-value, defines the minimum correlation coefficient for defining an edge between features. The second parameter, gamma (also referred to as resolution), controls the size of modules detected, with large gamma values yielding larger modules.
WGCNA and LMM have a potential weakness in that modules can contain pairs of taxa that are not strongly correlated (e.g. if they are several steps away from each other in the network). To address this weakness, we also implement the SMD method to ensure that correlations between all pairs of features in the module have an R-value greater than the user provided minimum (Figure 2). Specifically, the SMD method defines modules by first applying complete linkage hierarchical clustering to correlation coefficients to make a tree of features. Next, SMD defines modules as subtrees where correlations between all pairs of tips have an R-value above the specified value. SMD has been set as the default method in SCNIC because of the desirable property of only producing modules where all features are correlated over a user-specified threshold.
A large proportion of microbiome studies sample highly uneven communities which leads to strong compositionality-driven artifacts [26, 40, 41]. Because of this, we use SparCC, specifically the implementation of FastSpar [39], as the default correlation measure. SparCC was used as the correlation metric based on analysis that suggested a high precision in the number of correct edges recovered when correlations were calculated in synthetic data [13]. SCNIC additionally includes the option of using Pearson’s r , Spearman’s and Kendall’s τ to evaluate non-compositional or dense data types.