Bioinformatics read processing
Raw reads were quality checked in Fastqc (Babraham Institute, 2013). Primers were trimmed using fastx_trimmer and reads processed in Trimmomatic (Bolger, Lohse, & Usadel, 2014) using TRAILING:20 . Individual libraries were further processed, implementing several steps of the Usearch (Edgar, 2013) pipeline: reads were merged (optionmergepairs -fastq_minovlen 50, -fastq_maxdiffs 15 ), quality-filtered (Maxee = 1 ), trimmed to full length amplicons of 416-420 bp (-sortbylength ), dereplicated (-fastx_uniques ), and denoised and chimera checked (-unoise3, -minsize 2 ). Denoised reads from all 104 libraries, representing putative haplotypes, were then combined and dereplicated to yield a set of unique sequences across all samples, referred to as amplicon sequence variants (ASVs from here on; Callahan et al. , 2016). MEGAN V5 (Huson, Auch, Qi, & Schuster, 2007) with the lowest common ancestor (LCA) algorithm was used to compute the taxonomic affinity of each ASV. This classification was based on the result of a BLAST search (blastn -outfmt 5 -evalue 0.001 ) against a reference library including the NCBI nt database (Accessed at June 2018) together with an additional 559 unpublished taxonomically assigned Iberian sequences of Acari, Collembola, and Coleoptera.
ASVs classified by MEGAN as Acari, Collembola, and Coleoptera were processed with metaMATE (Andújar et al., 2021). MetaMATE evaluates the survival of ASVs under alternative filtering procedures based on the relative abundance of co-distributed ASVs. Briefly, the application of metaMATE involves a six-step procedure: (i) identification of verified authentic ASVs (va-ASVs) by 100% matching against a reference COI sequence; (ii) identification of ASVs including indels or STOP codons as verified non-authentic ASVs (vna-ASVs); (iii) generation of a community table with read-counts (ASV abundance) by sample against the complete collection of reads (i.e., before the dereplicating and denoising steps) using Usearch (-search_exact option); (iv) filtering with a range of criteria and threshold values; (v) evaluation of the survival of va-ASVs and vna-ASVs, and (vi) estimation of the predicted number of a-ASVs and na-ASVs, for every filtering iteration. Filtering parameters can thus be chosen according to desired stringency for the survival of a-ASVs and na-ASVs. (see Andújar et al. , 2021 for further details)
The following input files were used to run MetaMATE: (i) the set of unique ASVs (-A option); (ii) a reference dataset (-R) for the identification of va-ASVs, including all BOLD sequences for Acari, Collembola, and Coleoptera (downloaded at May 2020) plus 1,011 sequences from specimens collected at the Iberian Peninsula and the Canary Islands; (iii) all reads prior to the dereplicating and denoising steps (-L), and; (4) the specification file including filtering criteria and parameters to be evaluated (-S) (parameters used: –refmatchlength 350 –refmatchpercent 100 –expectedlength 418). Filtering was explored using both (i) minimum absolute and minimum percentage abundance by library and (ii) minimum percentage abundance by library and lineages at 20% divergence, and all pair combinations of these (See MetaMATE tutorial for details). Analyses were conducted independently for Acari, Coleoptera, and Collembola. Filtering parameters were selected for each taxon to maximize the number of surviving va-ASVs while maintaining the predicted contribution of na-ASVs to the final dataset to be ≤ 5%. Finally, the filtered set of ASVs was further filtered to reduce any potential cross-contamination problems across samples by removing ASVs with four or fewer reads from each library. Community tables of fully filtered haplotypes were then transformed into incidence (presence/absence) data for further analyses.