Sequence processing and data analysis
Sequence data and metadata were downloaded from NCBI and processed using the popular dada2 pipeline (B J Callahan et al., 2016) and standard parameters (maxN=0; maxEE=2, truncQ=2). As our goal was to explore the impact of shorter read lengths on the taxonomic assignment of prokaryotes, and its impact on the ecological conclusions derived from the data, only forward read lengths from each dataset were selected. Importantly for sequence data reuse, reverse reads are often not available in archived sequence data (Jurburg et al., 2020), either because pair-ended sequencing was not performed or the reverse reads are not archived. Indeed, one of the datasets used (Qian et al., 2017) had merged paired ends prior to archiving. For each sample, read length was varied from 50-200 bp in intervals of 10 bp. This range of read lengths was selected as it represents the minimum output of all next generation sequencing technologies. Taxonomy was assigned using SILVA v138 (Quast et al., 2013). For all samples, the number of unassigned reads at each taxonomic level, and the percentage of original reads included in the final ASV table was recorded.
ASV tables were analyzed using phyloseq (McMurdie & Holmes, 2013) and vegan (Oksanen et al., 2007) . To compare diversity estimates, all versions of each dataset were rarefied to the lowest number of reads (23,354 reads for the water dataset, 28,105 reads for the soil dataset, and 12,481 reads for the animal dataset). Unless otherwise noted, all analyses were performed on chimera-checked data. To explore the impact of read length on the detection of microbial alpha diversity, the 5 control samples of each dataset were selected to measure richness and inverse Simpson diversity (Chao, Chiu, & Jost, 2014), which are more heavily weighted by the rare and dominant taxa, respectively. Similarly, to explore the effects of read length on beta diversity, Bray-Curtis and Sorensen dissimilarities between samples were examined. To assess the extent to which read length affected the ecological conclusions derived from the data, samples from before and (1 day) after disturbance for each dataset were compared. For alpha diversity, control and disturbed samples were compared using a Wilcoxon test, and for beta diversity, control and disturbed samples were compared using a PERMANOVA (adonis2) for each read length. Finally, to examine the loss of ecological information with read length, a mantel test of the dissimilarities (Bray-Curtis and Sorensen) between the longest read length (200 bp) and all shorter reads was performed for each dataset.