Approaches to identify clustering thresholds on the basis of
within-species and within-genus sequence similarities
We first examined within-species and within-genus sequence similarities
to evaluate four different strategies and determine the corresponding
appropriate clustering threshold (Figure 1A) that: i ) avoid
over-splitting; ii ) avoid over-merging; iii ) find a
balance between over-splitting and over-merging, with two distinct
procedures based on the intersection (iii -a) or on modes
(iii -b) of the density probability distributions. These
strategies are analogous to those adopted in traditional barcoding
studies to set the limit between intra-specific and inter-specific
diversity (Meyer and Paulay 2005).
Avoid over-splitting
In this case, the aim is to avoid distributing different sequences
belonging to the same species in different clusters (i.e. limiting the
probability of generating additional spurious MOTUs). For this approach,
we selected as clustering threshold the 10% quantile of the
distribution of similarities between sequences from the same species
(within-species dataset). With this approach, the sequences belonging to
the same species according to EMBL are gathered in the same cluster in
90% of the cases.
Avoid over-merging
In this case, the aim is to avoid gathering sequences attributed to
different species of the same genus in the same cluster (i.e. limiting
the probability of merging related species in the same MOTU). For this
approach, we selected as clustering threshold the 90% quantile of the
distribution of similarities between different species belonging to the
same genus. With this approach, the sequences attributed to different
species belonging to the same genus are assigned to different clusters
in 90% of the cases.
Find a balance between over-splitting and over-merging
In this case, the aim was to minimize both over-splitting and
over-merging. We considered two distinct approaches. First, we obtained
the probability distribution of within-species and within-genus sequence
pairwise similarities using the density function from R, with
biased cross-validation (bw=“bcv”) as smoothing bandwidth selector and
a Gaussian smoothing kernel (kernel=“gaussian”; Venables and Ripley
2002). Other possible smoothing bandwidth selectors were tested, but
biased cross-validation was the approach best fitting the score
histograms for all markers and all datasets (data not shown). The
balance threshold iii- a was then identified as the intersection
between the probability distributions of the within-species and
within-genus similarities. As an alternative approach to balance
over-merging and over-splitting (iii- b), we calculated the
midpoint between the modes of the within-species and within-genus
probability distributions.