Approaches to identify clustering thresholds on the basis of within-species and within-genus sequence similarities
We first examined within-species and within-genus sequence similarities to evaluate four different strategies and determine the corresponding appropriate clustering threshold (Figure 1A) that: i ) avoid over-splitting; ii ) avoid over-merging; iii ) find a balance between over-splitting and over-merging, with two distinct procedures based on the intersection (iii -a) or on modes (iii -b) of the density probability distributions. These strategies are analogous to those adopted in traditional barcoding studies to set the limit between intra-specific and inter-specific diversity (Meyer and Paulay 2005).
Avoid over-splitting
In this case, the aim is to avoid distributing different sequences belonging to the same species in different clusters (i.e. limiting the probability of generating additional spurious MOTUs). For this approach, we selected as clustering threshold the 10% quantile of the distribution of similarities between sequences from the same species (within-species dataset). With this approach, the sequences belonging to the same species according to EMBL are gathered in the same cluster in 90% of the cases.
Avoid over-merging
In this case, the aim is to avoid gathering sequences attributed to different species of the same genus in the same cluster (i.e. limiting the probability of merging related species in the same MOTU). For this approach, we selected as clustering threshold the 90% quantile of the distribution of similarities between different species belonging to the same genus. With this approach, the sequences attributed to different species belonging to the same genus are assigned to different clusters in 90% of the cases.
Find a balance between over-splitting and over-merging
In this case, the aim was to minimize both over-splitting and over-merging. We considered two distinct approaches. First, we obtained the probability distribution of within-species and within-genus sequence pairwise similarities using the density function from R, with biased cross-validation (bw=“bcv”) as smoothing bandwidth selector and a Gaussian smoothing kernel (kernel=“gaussian”; Venables and Ripley 2002). Other possible smoothing bandwidth selectors were tested, but biased cross-validation was the approach best fitting the score histograms for all markers and all datasets (data not shown). The balance threshold iii- a was then identified as the intersection between the probability distributions of the within-species and within-genus similarities. As an alternative approach to balance over-merging and over-splitting (iii- b), we calculated the midpoint between the modes of the within-species and within-genus probability distributions.