Discussion
Amplicon sequencing remains the most common method for identifying microbial communities, largely due to its low price and high throughput relative to more novel techniques (e.g., long-read sequencing, shotgun metagenomics). As the popularity of amplicon sequencing continues to grow, so does the wealth of archived 16S rRNA sequences, and understanding how bioinformatics choices affect the definition of species, and how this in turn affects the detection of microbial diversity and changes in this diversity is essential for the interpretation and reuse of these data (Jurburg et al., 2022). This work evaluated how shorter read lengths affect the detection of microbial taxa, their taxonomic assignments, and biodiversity estimates derived from these data. Its findings indicate that short read lengths recover biodiversity patterns, but special caution should be taken in the selection of biodiversity metrics to examine these data.
As expected, shorter read lengths resulted in more unclassified ASVs, but this was dependent on the target taxonomic level and varied across read lengths. Classification was best in the animal dataset, which was the least diverse and most-well characterized system. Importantly, only marginal improvements in taxonomic assignments were obtained by read lengths greater than 100 bp at the family level and above for all the datasets used, suggesting that, if only forward reads are available, little information is lost by 100 bp reads relative to the full forward read. Our results also highlight that genus-level taxonomic assignments greatly depend on how well-characterized the microbiota of the target environment are, and suggest that interpretations of genus-level assignments are not recommended for shorter reads (Thompson et al., 2017).
Further analyses highlighted the robustness of alpha and beta diversity metrics, especially abundance-weighted metrics (i.e., Inverse-Simpson index and Bray-Curtis dissimilarities, to shorter read lengths. Reads of 90 bp could recover the majority of the alpha diversity observed with 200 bp, as well as the dissimilarity between communities belonging to both biological replicates (i.e., variance or dispersion) and different treatments. Importantly, the similarity between the 200 bp datasets and their shorter versions increased with read length when assessed with incidence-based Sorensen dissimilarities, but remained high for abundance-weighted Bray-Curtis dissimilarities, even for the shortest reads. As these two dissimilarity metrics differ only in their abundance weighing, the differences observed when using each suggest that rare taxa are the ones most affected by shorter read lengths, highlighting the dependence of rare taxa on bioinformatics parameters.
Similarly, the detection of ASVs increased linearly with read length until a saturation point that aligned with the expected diversity in each environment explored (i.e., from least to most diverse, the animal, aquatic, and soil microbiomes), emphasizing the importance of defining diversity estimates relative to the trimming parameters. These results highlight the importance of considering diversity estimates, particularly incidence-based alpha diversity metrics (i.e., richness) as a function of read length. In the case of data reuse and comparison among datasets, this study demonstrates the importance of applying a uniform read length across datasets in order to have comparable diversity estimates.
With second generation sequence data (i.e., Illumina MiSeq), sequence quality decreases with read length (Ben J Callahan et al., 2016). Consequently, less reads pass quality checking, resulting in less reads (or observations) in the final, processed dataset. Short read lengths may therefore increase the number of observations per sample, particularly in low-quality sequences. Furthermore, different studies employ different sequencing platforms, which produce reads of variable lengths, the shortest of which is Illumina HiSeq, featuring a maximum read length of 150 bp, including barcodes and primers (Di Bella, Bao, Gloor, Burton, & Reid, 2013). In the case of pair-ended sequence data, only forward or merged reads are often archived (Jurburg et al., 2020). This work demonstrates how one aspect of sequence processing (i.e., trimming) affects the detection and taxonomic assignment of microbial diversity. While several studies have examined how technical choices (i.e., primer choice (Fouhy, Clooney, Stanton, Claesson, & Cotter, 2016; Martínez-Porchas, Villalpando-Canchola, & Vargas-Albores, 2016; Tremblay et al., 2015), pipeline selection (Marizzoni et al., 2020), and rarefaction (McKnight et al., 2018; Weiss et al., 2017)) affect the detection of diversity, systematic assessments of how other technical choices (particularly bioinformatics parameters e.g., chimera checking) affect the microbial diversity estimates are lacking, but urgently needed. Importantly, short reads enable the reuse of sequence data in their rawest form, allowing for complete and unified reprocessing of the sequence data from different studies, which may in turn improve comparability among them (Kang et al., 2021).
Processing metabarcoding data requires making a series of choices that affect the final dataset and its interpretation (Abellan-Schneyder et al., 2021). Sequence trimming is a critical part of processing, but its effect on the resulting diversity estimates are often overlooked. The analyses presented focused on the effect of sequence trimming in the popular dada2 pipeline, which detects amplicon sequence variants (ASVs) rather than grouping sequences into clusters of 97% sequence similarity. Dada2 has been extensively validated, and exhibits high sensitivity to ASVs (Prodan et al., 2020). While the findings in this study may guide the general processing of amplicon sequencing data, it is important to note that the findings are specific to the dada2 pipeline.
This study lays the groundwork for the analysis and reanalysis of metabarcoding data using short read lengths, and results in several recommendations. First, when comparing data with different technical backgrounds (i.e., from different studies), trimming to the same read length is important, especially for the analysis of alpha diversity. Second, when using short read lengths, caution should be taken with the interpretation of genus-level classifications. Third, abundance-weighted diversity metrics (i.e., inverse Simpson index, Bray-Curtis dissimilarity) are more robust to read length than incidence-based metrics (i.e., richness and Sorensen dissimilarity). Finally, the detection of microbial diversity from sequence data is far from absolute, and should instead be considered relative to the read length employed.