BACKGROUND
Microbial communities play important roles in environmental and human
health systems and can often reach great complexity. In these rich
ecosystems, microbes interact with each other, forming relationships
based on predator-prey dynamics [1], competition for resources
[2], cross-feeding of small compounds, [3] and other factors.
Identifying correlated pairs of microbes can suggest potential
interactions or shared environmental preferences. Accordingly, studies
have identified complex networks of co-occurring microbes in a variety
of different environments ranging from the human mouth and gut [4]
to soil [5] and stream ecosystems [6].
To detect correlations between microbes, a variety of methods have been
developed. While traditional correlation metrics are used by some
[7-9], newer methods have been developed that take into account the
properties of 16S rRNA sequencing data [10-12]. A recent review
tested these methods on a variety of models and identified some methods
that performed better than others in ways that can depend on underlying
data characteristics [13]. Although these tools are useful for
finding pairwise relationships between organisms, less attention has
been given toward developing methods for finding correlations among
groups of microbes.
One way to explore complex interactions is to form networks in which
correlated organisms are joined with an edge, and highly correlated sets
of microbes are defined. Here, we refer to these sets as modules, which
are synonymous to clusters or groups. There are two primary benefits of
finding modules of correlated microbes. First, the combination of
microbes in a module could be further explored to understand microbial
interactions, such as cross-feeding relationships, or shared
environmental niches [5, 14-16]. Second, considering correlation
structure among microbes can aid in statistical analysis aimed at
uncovering relationships between microbes and other environmental
factors. Specifically, by eliminating or summarizing highly correlated
features, dependence between features is decreased. Feature reduction
will increase accuracy of methods that assume the independence of
features such as false discovery rate technique (FDR) measurements like
the Benjamini-Hochberg Correction [17], and statistical power is
increased by reducing the number of feature comparisons.
One workflow for considering groups of correlated microbes in downstream
statistical analyses requires three steps: first, correlations between
microbes must be measured and used to form a network; second, modules
must be identified; and third, abundance of the microbes in modules must
be summarized for use in subsequent statistical analyses. One software
tool that has implemented this workflow, developed for application to
gene expression data, is weighted gene correlation network analysis
(WGCNA)[18]. WGCNA builds correlation networks based on a
correlation coefficient (such as Pearson, Spearman, or biweight
midcorrelation [19]), and detects modules as subtrees in a
hierarchical cluster of features [20]. Modules are summarized by
setting module abundance to that of network hubs or an eigenvector of
the abundance of all module members [18].
Several groups have used WGCNA to find correlations within 16S rRNA
sequencing data [21-24], but this approach may not be appropriate
for several reasons [25]. First, the correlation metrics implemented
in WGCNA do not account for sparsity and compositionality. Most
sequencing-based microbiome datasets are sparse (i.e. there are many
zeros) and compositional, meaning they only carry information on
relative abundances of taxa instead of absolute abundances, which can
lead to the detection of spurious correlations if proper statistical
methods are not used [26]. Thus, the use of WGCNA for compositional
data may be leading to the detection of spurious edges in microbiome
networks. Second, the primary method WGCNA uses to pick modules assumes
the correlation network will have a scale-free topology that may not be
relevant to microbiome data [27]. Third, summarizing modules through
identifying hub taxa works well in gene expression where a single
transcription factor can control the expression of many genes, but may
not be appropriate in microbial communities. Both the hub and
eigenvector approaches to module summarization do not allow for output
tables that maintain the total counts of microbial abundance per sample.
Therefore, the hub and eigenvector approaches cannot be used with tools
developed for microbiome data analysis that make assumptions based on
total sample counts, such as ANCOM [28] or metagenomeSeq [29].
Optimal methods for identifying and summarizing modules of correlated
features in 16S rRNA sequencing data have not been deeply explored. One
study [25] recommended an ensemble approach for correlation
detection, and the Louvain modularity maximization (LMM) method [30]
to identify modules. LULU is a tool that follows a binning approach
towards OTUs that co-occur, but only does so if they are highly
phylogenetically related [31]. Another tool, CoNet, uses an ensemble
approach to build and visualize networks [32]. However, no
implementation of module summarization was made available for downstream
statistical analysis.
To address these gaps, we have developed a tool for sparse,
compositional correlation network investigation for compositional data
(SCNIC), which uses methods optimized for microbiome data analysis.
SCNIC is available as standalone Python software, via Bioconda [33]
and the package installer for Python (pip), and as a QIIME 2 plugin
[34]. The source code for SCNIC and the QIIME 2 plugin is freely
available on GitHub (https://github.com/lozuponelab/SCNIC ,
https://github.com/lozuponelab/q2-SCNIC) under the BSD-3-Clause License.