Population sampling and DNA extraction
Between 2018 and 2019, we collected 140 individuals of C. cactorum from 28 sampling sites spanning a broad range of its native distribution in southern South America (Figure 1; Table S1). Individuals were collected directly from host plants, including six native taxa,O. megapotamica, O. rioplatense, O. elata, O. penicilligera, O. quimilo, and O. anacantha , and the exotic O. ficus-indica(Varone et al., 2014). Given that adult moths have nocturnal activity and a short lifespan, only eggsticks and I-IV instar larvae were collected. Collected eggsticks were maintained in the lab under constant temperature (25°C) and humidity (70%). After eggs hatched, larvae were reared on O. ficus-indica until the IV instar stage. To avoid the inclusion of siblings, only a single individual per sampled plant and eggstick was used for genetic analysis. Larvae were individually transferred to vials containing 100% ethanol and stored at –20°C until DNA extraction. Before DNA extraction, individual larvae were dissected to remove any food residue, and ~30 mg of clean tissue was used for DNA extraction using the Qiagen DNeasy Blood & Tissue Kit following manufacturer’s instructions (Qiagen, Inc.).
Genomic libraries preparation, filtering and assembling
Individuals were processed into two genomic libraries following the double digest restriction-site Associated DNA (ddRAD) procedure described in Peterson, Weber, Kay, Fisher and Hoekstra (2012). Genomic DNA was fragmented using NspI and MboI restriction enzymes, and then purified and ligated to barcoded adapters. Samples were pooled within multiplexing batches and bead purified. Libraries were sequenced in paired-end 125 bp mode on a HiSeq2500 Illumina instrument.
Demultiplexed raw reads were checked for quality using fastqc v0.11.15 (Andrews et al., 2010), and outputs were compiled and summarized on MultiQC v1.10.1 (Ewels, Magnusson, Lundin & Käller, 2016). Reads were assembled with IPYRAD v.0.9.59 (Eaton & Overcast, 2020) using the strictest filter for Illumina adapters. A reference based assembly method was implemented for mapping the filtered reads against theC. cactorum draft genome (NCBI accession number: JADGIL000000000). As this genome was assembled from an adult male (clean DNA can be obtained from males, as they do not feed), it serves as a filter for exogenous reads derived from larval food (Poveda-Martinez et al., 2022). In subsequent IPYRAD assembly steps, we allowed 20% of SNPs per RAD locus, and shared heterozygous sites occurring across a maximum of 25% of samples. A minimum of 90% was set for the minimum number of samples scored per locus. We used vcftools v1.15 (Danecek et al., 2011) to remove SNPs with minimum allele frequency lower than 3%, missing data per site across individuals exceeding 25%, and to keep SNPs with read depths between 6X to 100X. Individuals with more than 20% missing data were excluded from further analyses. In order to prune SNPs in linkage disequilibrium, PLINK v1.9 (Purcell et al., 2007) was used with a window size of 50 bp, window shift of 5 and VIF threshold of 2. SNPs under selection were identified using Bayescan v.2.1 (Foll & Gaggiotti, 2008) and excluded from the dataset. A fragment of the mitochondrial gene encoding for the Cytochrome oxidase subunit I (COI ) (mtDNA) was amplified using primers C1-J-2183 (Simon et al., 1994) and PatII (Caterino & Sperling, 1999). Amplifications, PCR-product purification and sequencing were performed as detailed in Poveda-Martínez et al., (2022).
Assessing genomic population structure
Population genetic structure was inferred using genome-wide SNP data (nDNA hereafter) and the sparse non-negative matrix factorization approach (sNMF) (Frichot, Mathieu, Trouillon, Bouchard & François, 2014), as implemented in R using the LEA package (Frichot & François, 2015). The number of genetic clusters that best described our data was assessed with 100 repetitions for each possible K value (from K=1 to K=10) and using the cross-entropy criterion. In addition to sNMF, genetic structure was approximated by using the major axes of genomic variation obtained from a Principal Component Analysis (PCA), as implemented in the hierfstat R package (Goudet, 2005). We used mtDNA data to construct a haplotype network using the Median Joining algorithm (Bandelt, Forster, & Röhl, 1999) as implemented in PopArt v.1.7.1 (Leigh & Bryant, 2015).
Phylogenetic analyses
Phylogenetic relationships were reconstructed among the main genetic groups as inferred in sNMF using nDNA data and the coalescent-based method implemented in SNAPP (Bryant, Bouckaert, Felsenstein, Rosenberg, & RoyChoudhury, 2012). Due to computational burden, SNAPP analyses were run including only five individuals per genetic cluster, selecting those with an ancestry coefficient higher than 0.85, according to sNMF. One SNP per locus was randomly selected resulting in a new matrix of 2,086 SNPs shared across tips. The default model parameters were used in SNAPP for U and V equal to one, and the analysis was run for 5,000,000 MCMC generations, sampling every 1000 generations in Beast v.2.5.2 (Bouckaert et al., 2014). The complete set of trees was visualized in Densitree v.2.2.5 (Bouckaert & Heled, 2014), removing the first 10% of the trees as burn-in. A maximum credibility tree was generated using TreeAnnotator v.1.7.5 (Drummond, Suchard, Xie, & Rambaut, 2012).
Demographic history
A coalescent-based simulation approach was implemented in Fastsimcoal2 based on the site frequency spectrum (SFS) of nDNA data (Excoffier, Dupanloup, Huerta-Sanchez, Sousa & Foll, 2013) to further investigate the demographic history of the three main lineages identified in clustering analysis and species tree inference (Center, East and South, see results section). Initially, to further evaluate the consistency of phylogenetic relationships among C. cactorum populations inferred in SNAPP, models representing the three possible topologies (A, B and C) were considered in each one of the subsequent models of gene flow (Figure S1). For each topology, divergence times and contemporary effective population sizes were estimated, as well as alternative scenarios of gene flow. Models 1-3 considered no gene flow among lineages; models 4-6 contemplated scenarios of symmetric gene flow among lineages, while models 7-9 considered scenarios of asymmetric gene flow among lineages. For these analyses, 20 individuals were selected with the highest ancestry coefficient (> 0.85) according to sNMF (e.g., Ortego, Céspedes, Millán & Green, 2021). The folded joint SFS was calculated considering a single SNP per locus using the Python script written by Isaac Overcast and available at GitHub (https://github.com/isaacovercast/easySFS). To remove missing data and minimize errors with allele frequency estimates, each genetic cluster was downsampled to 16 individuals yielding a total of 2,565 variable sites. Assuming that invariable sites were not considered in the SFS calculation, we used the “removeZeroSFS” option in Fastsimcoal2 and fixed the effective population size of one of the demes (South lineage, NESOUTH) to enable the estimation of other parameters with Fastsimcoal2 (Excoffier et al., 2013). NESOUTH was calculated according to the equation NE= π/4μ (Lynch & Conery, 2003). Nucleotide diversity (π=0.004) was estimated using variant and invariant sites with DNAsp v6 (Rozas et al., 2017). The mutation rate (μ) was considered to be 2.9 x 10-9 substitutions per site per generation, previously estimated for the butterfly Heliconius melpomene (Lepidoptera: Nymphalidae) (Keightley et al., 2015). Each model was run with 50 replicates, considering 100,000–250,000 simulations for the calculation of the composite likelihood, 10–40 expectation-conditional maximization (ECM) cycles, and a stopping criterion of 0.001. Once maximum likelihood was estimated per run, the best demographic model was selected according to Akaike’s information criterion (AIC). AIC values were rescaled in terms of AIC differences (Δi) according to the formula: Δi = AICi − AICmin. A model with a ΔAIC value of 0 and the highest AIC weight (ωi) served as the best model. A parametric bootstrapping approach was used to construct 95% confidence intervals of the estimated parameters running 100 bootstrap replicates using initialized values from the best model (Excoffier et al., 2013).
Environmental niche modeling
An environmental niche modeling approach was implemented to predict contemporary and historical geographic distributions of suitable areas for the cactus moth in its native range. Environmental niche models (ENMs) were used to determine the impact of suitable habitat distribution on C. cactorum diversification as is detailed below in the landscape genomic analysis section. ENMs were built using MaxEnt v.3.4.1 (Phillips, Anderson, & Schapire, 2006) and the bioclimatic data available in CHELSA v.1.2 database (https://chelsa-climate.org/bioclim/). Model parameters in MaxEnt were selected and optimized using the kuenm R package (Cobos, Peterson, Barve & Osorio-Olvera, 2019). For both temporal bioclimatic conditions, present-day and Last Glacial Maximum (LGM, ca. 21 kya), we evaluated 15 environmental variables retrieved at 30 arc-sec (~1 km) of resolution. Four bioclimatic variables (Bio 8, 9, 18 and 19) were discarded for having artificial breaks (Oliveira et al., 2020). Highly correlated variables (R >0.9) according to variance inflation factor criterion were excluded for downstream analyses resulting in a final dataset of 6 bioclimatic variables (Bio 2, 3, 5, 13, 14 and 15). Suitability maps during the LGM were obtained by projecting the present-day ENM onto LGM bioclimatic conditions derived from the Community Climate System Model (CCSM4; Braconnot et al. 2007) resulting in two suitability maps based on bioclimatic variables data (Set 1, CurrentENV and LGMENV from hereafter).
Additionally, we constructed ENMs for five of the six native host species of C. cactorum considering the aforementioned approach for variable selection and model building. ENM for O. penicilligera could not be implemented due to limited occurrence data. The resulting ENMs obtained for current and LGM conditions from each host species were used as variables to build an additional ENM forC. cactorum (Set 2, CurrentHOST and LGMHOST from hereafter). A third ENM was constructed forC. cactorum considering the input of both the bioclimatic variables (Set 1) and the host plant ENMs (Set 2) (Set 3, CurrentENV-HOST and LGMENV-HOST from hereafter). Occurrence data for the host species and C. cactorumwere obtained from field surveys made from 2007 to 2019 and complemented, after detailed curation, with distribution information available at GBIF (www.gbif.org). Redundant occurrences (e.g. points occurring within 1,500 km2) were excluded using spThin R package (Aiello‐Lammens, Boria, Radosavljevic, Vilela & Anderson, 2015). After thinning occurrence data, 81 records for C. cactorum locations and their associated occurance of 205 host species remained [O. quimilo (50), O. megapotamica (47), O. rioplatense (39),O. elata (38), and O. anacantha (31)] and were used to conduct ENMs. Model performance for each scenario was evaluated independently based on statistical significance (Partial ROC), omission rates (OR), and the Akaike information criterion corrected for small sample sizes (AICc) using the kuenm R package (Cobos et al. 2019).
Landscape genomic analyses
A landscape genomic approach was implemented to study potential factors that could explain patterns of genetic differentiation within C. cactorum . As a measure of genetic differentiation, pairwise FST estimates were derived from genome-wide SNP data using the Weir and Cockerham (1984) method with the StAMPP R package (Pembleton, Cogan & Forster, 2013), using 9,999 bootstrap replicates. We evaluated several plausible scenarios of population connectivity based on historical and contemporary spatial and ecological data.
Isolation by resistance scenarios (IBR) were tested by calculating resistance surfaces based on the suitability maps obtained from ENMs for present-day and LGM conditions considering different subsets of factors: (i) only climatic variables (CurrentENV and LGMENV); (ii) climate-based habitat suitability maps for host species (CurrentHOST and LGMHOST); and (iii) combining climatic variables and climate-based habitat suitability maps for host species (CurrentHOST-ENV and LGMHOST-ENV). Resistance distances for all pairs of populations were calculated using an eight-neighbor cell connection scheme in Circuitscape v.5 (Hall et al., 2021) through Julia v.1.5.2 (https://julialang.org/). We also calculated resistance distances based on a “flat landscape” where all cells have an equal resistance value (=1) representing a null model of isolation by resistance (IBRNULL). Resulting between-population resistance matrices were used as input for downstream statistical analyses.
Climatic dissimilarities were estimated between populations to evaluate an isolation by environment scenario (IBECLI). Climatic data was extracted from the 15 CHELSA bioclimatic variables for each of the 28 sampling sites, as well as from 500 random points covering our study area to avoid potential biases resulting from only considering conditions at focal sites. Due to collinearity among the bioclimatic variables, we ran a PCA using the ade4 R package and summarized the environmental variation in the three first axes accounting for ca. 82% (PC1=43.79%; PC2=23.96%; PC3=14.31%) of total variation. The environmental dissimilarity matrix was obtained by calculating the Euclidean distances for PC scores between pairs of sampling sites (Ortego et al., 2021).
An isolation by distance scenario was also evaluated using weighted topographic distances (IBDWTD), which incorporate an additional overland distance covered by an organism due to changes in elevation imposed by the topography. The IBDWTD distance matrix was calculated on a digital elevation model (DEM) at ~1 km of resolution retrieved from WorldClim v.2.1 dataset (Fick & Hijmans, 2017) using the TopoWeightedDist function implemented in the topoDistance R package (Wang, 2020). We assumed a linear function to weight aspect changes (hFunction parameter) and an exponential function to weight the slope (vFunction parameter) (e.g. Noguerales et al., 2021).
Relationships between explanatory distance matrices and genetic differentiation between populations (FST) were evaluated using univariate and multiple matrix regressions with randomization using the function MMRR (Wang, 2013) as implemented in R. An initial full model was constructed considering all significant explanatory terms identified previously in univariate analysis, and a final best-fit model was selected using a backward-stepwise procedure by progressively removing non significant variables until all retained terms within the model were significant (Ortego, Gugger & Sork, 2014). The result was the minimal most adequate model for explaining the variability in the response variable, where only the significant explanatory terms were retained.
Population genetic diversity and climate/habitat stability
Given that suitable environments tend to support larger populations (Carnaval, Hickerson, Haddad, Rodrigues & Moritz, 2009; Soley-Guardia, Carnaval & Anderson, 2019), we tested whether the regions exhibiting high genetic diversity were those where suitable climatic conditions remained stable through time, from the LGM (~21 Kya) to the present. To test this hypothesis, environmental stability maps were constructed by averaging current and LGM suitability maps. For each sampling site, values of climate/habitat stability were extracted for each of the three scenarios considering (i) only climatic variables (StabilityENV), (ii) climate-based habitat suitability maps for host species (StabilityHOST), and (iii) combining climatic variables and climate-based habitat suitability maps for host species (StabilityENV-HOST). Raster calculations were conducted using the R raster package (Hijmans & van Etten, 2016).
Nuclear and mitochondrial diversity were characterized for each of the 28 sampling sites. Genetic diversity estimates were normalized to three individuals per sampling location to avoid potential bias resulting from uneven population sample size (range: 3-11 individuals). For nDNA, expected heterozygosities (H E) and nucleotide diversity (π) were calculated using the diveRsity R package (Keenan, McGinnity, Cross, Crozier & Prodöhl, 2013), and DNAsp v6 (Rozas et al., 2017), respectively. We also estimated π and haplotype diversity (Hd) for mtDNA data using DNAsp v6 (Rozas et al., 2017). Correlations between stability values and estimates of nuclear (H E, π) and mtDNA (HD, π) population genetic diversity were tested using linear regressions. Longitude and latitude were also included as explanatory factors to account for potential geographical clines of genetic diversity (Guo, 2012).
Results
Genomic data
We successfully genotyped 138 individuals of C. cactorum using ddRAD sequencing which were representative of populations feeding on both exotic O. ficus-indica (71 individuals), and native host plants O. megapotamica (34), O. rioplatense (11), O. elata (10), O. bonaerensis (4), O. penicilligera (3),O. quimilo (3), and O. anacantha (2). After filtering steps, the average number of paired-end reads retained per individual was 1,859,116, of which an average of 1,253,277 reads mapped to theC. cactorum reference genome. After discarding loci in linkage disequilibrium and under selection, we recovered a total of 3,506 unlinked sequence loci with an average coverage of 30X which contained 17,084 biallelic SNPs (Table S2). A total of 143 individuals from the same 28 locations were successfully sequenced for a fragment of 790 bp of the COI gene. The analysis of mtDNA data revealed 66 haplotypes, 171 variable sites, 97 of which were parsimony-informative sample-wide.
Assessing population genomic structure
Clustering analysis with sNMF indicated K=6 to be the most likely number of genetic clusters: North, Central, Western, Southeast, Southwest and East. However, these analyses revealed a certain degree of admixture among clusters; some Northern and Central individuals exhibited a relatively high degree of admixture with both the Western and Eastern clusters (Figure 2A-C). Interestingly, Northern, Central and Western populations appeared to form a unique cluster (Central lineage from hereafter) when assuming K=3, whereas both Southern (South lineage from hereafter) and Eastern (East lineage from hereafter) populations remained as separated panmictic groups. With K=4, the Southern population appeared subdivided into two geographic units; Southeastern and Southwestern. Finally, with K=5, Northern populations appeared separated from the Central population (Figure S2). Principal component analysis (PCA) reveals the Eastern lineage to appear separated from the Central lineage along PC1, whereas the South lineage can be distinguished from the Central and Eastern lineages along PC2 (Figure 2D). Additional PCA structuring was also evident, mainly in the Southern lineage in agreement with the clustering pattern observed with K=4 in sNMF. Moreover, the PCA revealed an apparent admixture signal between the Central and Eastern lineages.
The median-joining network obtained with mtDNA sequence data revealed strong spatial genetic structure consistent with the clusters identified with nDNA data (Figure 2E). The Central haplogroup was one of the most frequent haplotypes and was separated by seven mutational steps from the Eastern haplogroup, whereas four mutational steps separated the Central and Northern haplogroups. Southeastern and Southwestern haplogroups appeared close to the Central haplogroup and shared a common haplotype. Despite spatial genetic structure, a few Central haplotypes were distributed in other regions (three central haplotypes in the East, two in the Southeast and one in the North). A frequent haplotype in the Northern haplogroup was also detected in the central region and a haplotype from the East was observed in the north region.
Species tree reconstruction
Phylogenetic relationships inferred in SNAPP were consistent with the hierarchical spatial genetic structure observed with sNMF (Figure 2B-C; Figure S2). The most ancestral split corresponded to the separation of South and East lineages from the Central lineage. Southwestern and Southeastern clusters, as well as, West, Central and North clusters diverged subsequently from their respective lineages, as observed in clustering analyses. The maximum clade credibility tree indicated a high posterior probability (> 0.9) supporting all nodes in the species tree (Figure 2B).
Demographic inference using coalescent-based simulations
The scenario considering full asymmetric interpopulation gene flow (model 8) was the most supported among the nine demographic models tested in Fastsimcoal2 (Table S3). Under this model, divergence among the three main lineages of C. cactorum was estimated to have occurred during the Late Pleistocene (Figure 3; Table 1). Specifically, this model describes an initial split (TDIV1) of the Central lineage ~75 kya (considering two generations per year in native host species, as reported by Varone et al., 2014) and a more recent split (TDIV2) separating the South and East lineages occurring ~19 kya. Demographic simulations estimated an effective population size close to 250,000 for the Central lineage (NECENTER), which was six times higher than the value obtained for the East lineage (NEEAST=41,551). Estimates of gene flow rates varied among lineages with the highest rate of contemporary gene flow estimated to occur from the Central to the East lineage (MCE=1.50 x 10-5) and the lowest from the Central lineage to the South lineages (MCS=2.71 x 10-8) (Table 1).
Ecological niche modeling through time
Ecological niche models (ENMs) for both the focal species (C. cactorum ) and its host species presented relatively high AUC scores (Table S4 and S5), indicating an overall good model performance. Suitability maps of host species were concordant with their respective current distributions (Figure S3). The three models for C. cactorum (Set 1: CurrentENV, Set 2: CurrentHOST, Set 3: CurrentENV-HOST) provided similar predicted distributions, with no major differences (Figure S4). However, the most supported model for C. cactorumaccording to AIC values was that considering as input variables the predicted distribution of the host plants (Set 2), followed by the model including these and the six bioclimatic variables (Set 3). Additional information on the results of ENMs for both C. cactorum and host species are detailed in Tables S4 and S5. The present-day distribution inferred by the most supported scenario (CurrentHOST) was in line with the contemporary range of C. cactorum . Under this model, higher suitability areas were predicted in both Chaco and north of Pampa biogeographic regions whereas areas of lower suitability were predicted in the southern portions of the Pampas and the Monte biogeographic regions (Figure 1B). Other biogeographic provinces (Yungas, Puna and Prepuna) had extremely low values of suitability forC. cactorum . Projections of ENM to LGM predicted high environmental suitability along the Chaco biogeographic province to the most southern portion of Pampas, with areas of low suitability located in eastern Argentina, a confluence zone of La Plata river basin (Figure 1C). Comparing within each scenario across time, from the LGM to the present, areas of high environmental stability were predicted in North and Central Argentina (the arid and semi-arid regions of Chaco), with low environmental stability predicted in southern Argentina.
Landscape genomic analyses
Estimates of genetic differentiation (FST) among sampling sites ranged from 0.023 to 0.448 for nDNA data (Table S6). Univariate matrix regressions indicated that nuclear genetic differentiation was significantly correlated with all distance matrices except with CurrentENV and CurrentENV-HOST (Table S7). Yet, only LGMENV-HOST was significantly retained in the best-fit model after the backward-stepwise procedure (Table 2).