Bioinformatics read processing
Raw reads were quality checked in Fastqc (Babraham Institute, 2013).
Primers were trimmed using fastx_trimmer and reads processed in
Trimmomatic (Bolger, Lohse, & Usadel, 2014) using TRAILING:20 .
Individual libraries were further processed, implementing several steps
of the Usearch (Edgar, 2013) pipeline: reads were merged (optionmergepairs -fastq_minovlen 50, -fastq_maxdiffs 15 ),
quality-filtered (Maxee = 1 ), trimmed to full length amplicons of
416-420 bp (-sortbylength ), dereplicated
(-fastx_uniques ), and denoised and chimera checked
(-unoise3, -minsize 2 ). Denoised reads from all 104 libraries,
representing putative haplotypes, were then combined and dereplicated to
yield a set of unique sequences across all samples, referred to as
amplicon sequence variants (ASVs from here on; Callahan et al. ,
2016). MEGAN V5 (Huson, Auch, Qi, & Schuster, 2007) with the lowest
common ancestor (LCA) algorithm was used to compute the taxonomic
affinity of each ASV. This classification was based on the result of a
BLAST search (blastn -outfmt 5 -evalue 0.001 ) against a reference
library including the NCBI nt database (Accessed at June 2018)
together with an additional 559 unpublished taxonomically assigned
Iberian sequences of Acari, Collembola, and Coleoptera.
ASVs classified by MEGAN as Acari, Collembola, and Coleoptera were
processed with metaMATE (Andújar et al., 2021). MetaMATE evaluates the
survival of ASVs under alternative filtering procedures based on the
relative abundance of co-distributed ASVs. Briefly, the application of
metaMATE involves a six-step procedure: (i) identification of verified
authentic ASVs (va-ASVs) by 100% matching against a reference COI
sequence; (ii) identification of ASVs including indels or STOP codons as
verified non-authentic ASVs (vna-ASVs); (iii) generation of a community
table with read-counts (ASV abundance) by sample against the complete
collection of reads (i.e., before the dereplicating and denoising steps)
using Usearch (-search_exact option); (iv) filtering with a range of
criteria and threshold values; (v) evaluation of the survival of va-ASVs
and vna-ASVs, and (vi) estimation of the predicted number of a-ASVs and
na-ASVs, for every filtering iteration. Filtering parameters can thus be
chosen according to desired stringency for the survival of a-ASVs and
na-ASVs. (see Andújar et al. , 2021 for further details)
The following input files were used to run MetaMATE: (i) the set of
unique ASVs (-A option); (ii) a reference dataset (-R) for the
identification of va-ASVs, including all BOLD sequences for Acari,
Collembola, and Coleoptera (downloaded at May 2020) plus 1,011 sequences
from specimens collected at the Iberian Peninsula and the Canary
Islands; (iii) all reads prior to the dereplicating and denoising steps
(-L), and; (4) the specification file including filtering criteria and
parameters to be evaluated (-S) (parameters used: –refmatchlength 350
–refmatchpercent 100 –expectedlength 418). Filtering was explored
using both (i) minimum absolute and minimum percentage abundance by
library and (ii) minimum percentage abundance by library and lineages at
20% divergence, and all pair combinations of these (See MetaMATE
tutorial for details). Analyses were conducted independently for Acari,
Coleoptera, and Collembola. Filtering parameters were selected for each
taxon to maximize the number of surviving va-ASVs while maintaining the
predicted contribution of na-ASVs to the final dataset to be ≤ 5%.
Finally, the filtered set of ASVs was further filtered to reduce any
potential cross-contamination problems across samples by removing ASVs
with four or fewer reads from each library. Community tables of fully
filtered haplotypes were then transformed into incidence
(presence/absence) data for further analyses.