Population sampling and DNA extraction
Between 2018 and 2019, we collected 140 individuals of C.
cactorum from 28 sampling sites spanning a broad range of its native
distribution in southern South America (Figure 1; Table S1). Individuals
were collected directly from host plants, including six native taxa,O. megapotamica, O. rioplatense, O. elata, O. penicilligera, O.
quimilo, and O. anacantha , and the exotic O. ficus-indica(Varone et al., 2014). Given that adult moths have nocturnal
activity and a short lifespan, only eggsticks and I-IV instar larvae
were collected. Collected eggsticks were maintained in the lab under
constant temperature (25°C) and humidity (70%). After eggs hatched,
larvae were reared on O. ficus-indica until the IV instar stage.
To avoid the inclusion of siblings, only a single individual per sampled
plant and eggstick was used for genetic analysis. Larvae were
individually transferred to vials containing 100% ethanol and stored at
–20°C until DNA extraction. Before DNA extraction, individual larvae
were dissected to remove any food residue, and ~30 mg of
clean tissue was used for DNA extraction using the Qiagen DNeasy Blood
& Tissue Kit following manufacturer’s instructions (Qiagen, Inc.).
Genomic libraries
preparation, filtering and assembling
Individuals were processed into two genomic libraries following the
double digest restriction-site Associated DNA (ddRAD) procedure
described in Peterson, Weber, Kay, Fisher and Hoekstra (2012). Genomic
DNA was fragmented using NspI and MboI restriction
enzymes, and then purified and ligated to barcoded adapters. Samples
were pooled within multiplexing batches and bead purified. Libraries
were sequenced in paired-end 125 bp mode on a HiSeq2500 Illumina
instrument.
Demultiplexed raw reads were checked for quality using fastqc v0.11.15
(Andrews et al., 2010), and outputs were compiled and summarized on
MultiQC v1.10.1 (Ewels, Magnusson, Lundin & Käller, 2016). Reads were
assembled with IPYRAD v.0.9.59 (Eaton & Overcast, 2020) using the
strictest filter for Illumina adapters. A reference based assembly
method was implemented for mapping the filtered reads against theC. cactorum draft genome (NCBI accession number:
JADGIL000000000). As this genome was assembled from an adult male (clean
DNA can be obtained from males, as they do not feed), it serves as a
filter for exogenous reads derived from larval food (Poveda-Martinez et
al., 2022). In subsequent IPYRAD assembly steps, we allowed 20% of SNPs
per RAD locus, and shared heterozygous sites occurring across a maximum
of 25% of samples. A minimum of 90% was set for the minimum number of
samples scored per locus. We used vcftools v1.15 (Danecek et al., 2011)
to remove SNPs with minimum allele frequency lower than 3%, missing
data per site across individuals exceeding 25%, and to keep SNPs with
read depths between 6X to 100X. Individuals with more than 20% missing
data were excluded from further analyses. In order to prune SNPs in
linkage disequilibrium, PLINK v1.9 (Purcell et al., 2007) was used with
a window size of 50 bp, window shift of 5 and VIF threshold of 2. SNPs
under selection were identified using Bayescan v.2.1 (Foll & Gaggiotti,
2008) and excluded from the dataset. A fragment of the mitochondrial
gene encoding for the Cytochrome oxidase subunit I (COI ) (mtDNA)
was amplified using primers C1-J-2183 (Simon et al., 1994) and PatII
(Caterino & Sperling, 1999). Amplifications, PCR-product purification
and sequencing were performed as detailed in Poveda-Martínez et al.,
(2022).
Assessing genomic
population structure
Population genetic structure was inferred using genome-wide SNP data
(nDNA hereafter) and the sparse non-negative matrix factorization
approach (sNMF) (Frichot, Mathieu, Trouillon, Bouchard & François,
2014), as implemented in R using the LEA package (Frichot & François,
2015). The number of genetic clusters that best described our data was
assessed with 100 repetitions for each possible K value (from K=1 to
K=10) and using the cross-entropy criterion. In addition to sNMF,
genetic structure was approximated by using the major axes of genomic
variation obtained from a Principal Component Analysis (PCA), as
implemented in the hierfstat R package (Goudet, 2005). We used mtDNA
data to construct a haplotype network using the Median Joining algorithm
(Bandelt, Forster, & Röhl, 1999) as implemented in PopArt v.1.7.1
(Leigh & Bryant, 2015).
Phylogenetic analyses
Phylogenetic relationships were reconstructed among the main genetic
groups as inferred in sNMF using nDNA data and the coalescent-based
method implemented in SNAPP (Bryant, Bouckaert, Felsenstein, Rosenberg,
& RoyChoudhury, 2012). Due to computational burden, SNAPP analyses were
run including only five individuals per genetic cluster, selecting those
with an ancestry coefficient higher than 0.85, according to sNMF. One
SNP per locus was randomly selected resulting in a new matrix of 2,086
SNPs shared across tips. The default model parameters were used in SNAPP
for U and V equal to one, and the analysis was run for 5,000,000 MCMC
generations, sampling every 1000 generations in Beast v.2.5.2 (Bouckaert
et al., 2014). The complete set of trees was visualized in Densitree
v.2.2.5 (Bouckaert & Heled, 2014), removing the first 10% of the trees
as burn-in. A maximum credibility tree was generated using TreeAnnotator
v.1.7.5 (Drummond, Suchard, Xie, & Rambaut, 2012).
Demographic history
A coalescent-based simulation approach was implemented in Fastsimcoal2
based on the site frequency spectrum (SFS) of nDNA data (Excoffier,
Dupanloup, Huerta-Sanchez, Sousa & Foll, 2013) to further investigate
the demographic history of the three main lineages identified in
clustering analysis and species tree inference (Center, East and South,
see results section). Initially, to further evaluate the consistency of
phylogenetic relationships among C. cactorum populations inferred
in SNAPP, models representing the three possible topologies (A, B and C)
were considered in each one of the subsequent models of gene flow
(Figure S1). For each topology, divergence times and contemporary
effective population sizes were estimated, as well as alternative
scenarios of gene flow. Models 1-3 considered no gene flow among
lineages; models 4-6 contemplated scenarios of symmetric gene flow among
lineages, while models 7-9 considered scenarios of asymmetric gene flow
among lineages. For these analyses, 20 individuals were selected with
the highest ancestry coefficient (> 0.85) according to sNMF
(e.g., Ortego, Céspedes, Millán & Green, 2021). The folded joint SFS
was calculated considering a single SNP per locus using the Python
script written by Isaac Overcast and available at GitHub
(https://github.com/isaacovercast/easySFS).
To remove missing data and minimize errors with allele frequency
estimates, each genetic cluster was downsampled to 16 individuals
yielding a total of 2,565 variable sites. Assuming that invariable sites
were not considered in the SFS calculation, we used the
“removeZeroSFS” option in Fastsimcoal2 and fixed the effective
population size of one of the demes (South lineage,
NESOUTH) to enable the estimation of other parameters
with Fastsimcoal2 (Excoffier et al., 2013). NESOUTH was
calculated according to the equation NE= π/4μ (Lynch & Conery, 2003).
Nucleotide diversity (π=0.004) was estimated using variant and invariant
sites with DNAsp v6 (Rozas et al., 2017). The mutation rate (μ) was
considered to be 2.9 x 10-9 substitutions per site per
generation, previously estimated for the butterfly Heliconius
melpomene (Lepidoptera: Nymphalidae) (Keightley et al., 2015). Each
model was run with 50 replicates, considering 100,000–250,000
simulations for the calculation of the composite likelihood, 10–40
expectation-conditional maximization (ECM) cycles, and a stopping
criterion of 0.001. Once maximum likelihood was estimated per run, the
best demographic model was selected according to Akaike’s information
criterion (AIC). AIC values were rescaled in terms of AIC differences
(Δi) according to the formula: Δi = AICi − AICmin. A model with a ΔAIC
value of 0 and the highest AIC weight (ωi) served as the best model. A
parametric bootstrapping approach was used to construct 95% confidence
intervals of the estimated parameters running 100 bootstrap replicates
using initialized values from the best model (Excoffier et al., 2013).
Environmental niche
modeling
An environmental niche modeling approach was implemented to predict
contemporary and historical geographic distributions of suitable areas
for the cactus moth in its native range. Environmental niche models
(ENMs) were used to determine the impact of suitable habitat
distribution on C. cactorum diversification as is detailed below
in the landscape genomic analysis section. ENMs were built using MaxEnt
v.3.4.1 (Phillips, Anderson, & Schapire, 2006) and the bioclimatic data
available in CHELSA v.1.2 database
(https://chelsa-climate.org/bioclim/).
Model parameters in MaxEnt were selected and optimized using the kuenm R
package (Cobos, Peterson, Barve & Osorio-Olvera, 2019). For both
temporal bioclimatic conditions, present-day and Last Glacial Maximum
(LGM, ca. 21 kya), we evaluated 15 environmental variables retrieved at
30 arc-sec (~1 km) of resolution. Four bioclimatic
variables (Bio 8, 9, 18 and 19) were discarded for having artificial
breaks (Oliveira et al., 2020). Highly correlated variables (R
>0.9) according to variance inflation factor criterion were
excluded for downstream analyses resulting in a final dataset of 6
bioclimatic variables (Bio 2, 3, 5, 13, 14 and 15). Suitability maps
during the LGM were obtained by projecting the present-day ENM onto LGM
bioclimatic conditions derived from the Community Climate System Model
(CCSM4; Braconnot et al. 2007) resulting in two suitability maps based
on bioclimatic variables data (Set 1, CurrentENV and
LGMENV from hereafter).
Additionally, we constructed ENMs for five of the six native host
species of C. cactorum considering the aforementioned approach
for variable selection and model building. ENM for O.
penicilligera could not be implemented due to limited occurrence data.
The resulting ENMs obtained for current and LGM conditions from each
host species were used as variables to build an additional ENM forC. cactorum (Set 2, CurrentHOST and
LGMHOST from hereafter). A third ENM was constructed forC. cactorum considering the input of both the bioclimatic
variables (Set 1) and the host plant ENMs (Set 2) (Set 3,
CurrentENV-HOST and LGMENV-HOST from
hereafter). Occurrence data for the host species and C. cactorumwere obtained from field surveys made from 2007 to 2019 and
complemented, after detailed curation, with distribution information
available at GBIF (www.gbif.org).
Redundant occurrences (e.g. points occurring within 1,500
km2) were excluded using spThin R package
(Aiello‐Lammens, Boria, Radosavljevic, Vilela & Anderson, 2015). After
thinning occurrence data, 81 records for C. cactorum locations
and their associated occurance of 205 host species remained [O.
quimilo (50), O. megapotamica (47), O. rioplatense (39),O. elata (38), and O. anacantha (31)] and were used to
conduct ENMs. Model performance for each scenario was evaluated
independently based on statistical significance (Partial ROC), omission
rates (OR), and the Akaike information criterion corrected for small
sample sizes (AICc) using the kuenm R package (Cobos et al. 2019).
Landscape genomic analyses
A landscape genomic approach was implemented to study potential factors
that could explain patterns of genetic differentiation within C.
cactorum . As a measure of genetic differentiation, pairwise
FST estimates were derived from genome-wide SNP data
using the Weir and Cockerham (1984) method with the StAMPP R package
(Pembleton, Cogan & Forster, 2013), using 9,999 bootstrap replicates.
We evaluated several plausible scenarios of population connectivity
based on historical and contemporary spatial and ecological data.
Isolation by resistance scenarios (IBR) were tested by calculating
resistance surfaces based on the suitability maps obtained from ENMs for
present-day and LGM conditions considering different subsets of factors:
(i) only climatic variables (CurrentENV and
LGMENV); (ii) climate-based habitat suitability maps for
host species (CurrentHOST and LGMHOST);
and (iii) combining climatic variables and climate-based habitat
suitability maps for host species (CurrentHOST-ENV and
LGMHOST-ENV). Resistance distances for all pairs of
populations were calculated using an eight-neighbor cell connection
scheme in Circuitscape v.5 (Hall et al., 2021) through Julia v.1.5.2
(https://julialang.org/). We also
calculated resistance distances based on a “flat landscape” where all
cells have an equal resistance value (=1) representing a null model of
isolation by resistance (IBRNULL). Resulting
between-population resistance matrices were used as input for downstream
statistical analyses.
Climatic dissimilarities were estimated between populations to evaluate
an isolation by environment scenario (IBECLI). Climatic
data was extracted from the 15 CHELSA bioclimatic variables for each of
the 28 sampling sites, as well as from 500 random points covering our
study area to avoid potential biases resulting from only considering
conditions at focal sites. Due to collinearity among the bioclimatic
variables, we ran a PCA using the ade4 R package and summarized the
environmental variation in the three first axes accounting for ca. 82%
(PC1=43.79%; PC2=23.96%; PC3=14.31%) of total variation. The
environmental dissimilarity matrix was obtained by calculating the
Euclidean distances for PC scores between pairs of sampling sites
(Ortego et al., 2021).
An isolation by distance scenario was also evaluated using weighted
topographic distances (IBDWTD), which incorporate an
additional overland distance covered by an organism due to changes in
elevation imposed by the topography. The IBDWTD distance
matrix was calculated on a digital elevation model (DEM) at
~1 km of resolution retrieved from WorldClim v.2.1
dataset (Fick & Hijmans, 2017) using the TopoWeightedDist function
implemented in the topoDistance R package (Wang, 2020). We assumed a
linear function to weight aspect changes (hFunction parameter) and an
exponential function to weight the slope (vFunction parameter) (e.g.
Noguerales et al., 2021).
Relationships between explanatory distance matrices and genetic
differentiation between populations (FST) were evaluated
using univariate and multiple matrix regressions with randomization
using the function MMRR (Wang, 2013) as implemented in R. An initial
full model was constructed considering all significant explanatory terms
identified previously in univariate analysis, and a final best-fit model
was selected using a backward-stepwise procedure by progressively
removing non significant variables until all retained terms within the
model were significant (Ortego, Gugger & Sork, 2014). The result was
the minimal most adequate model for explaining the variability in the
response variable, where only the significant explanatory terms were
retained.
Population genetic
diversity and climate/habitat stability
Given that suitable environments tend to support larger populations
(Carnaval, Hickerson, Haddad, Rodrigues & Moritz, 2009; Soley-Guardia,
Carnaval & Anderson, 2019), we tested whether the regions exhibiting
high genetic diversity were those where suitable climatic conditions
remained stable through time, from the LGM (~21 Kya) to
the present. To test this hypothesis, environmental stability maps were
constructed by averaging current and LGM suitability maps. For each
sampling site, values of climate/habitat stability were extracted for
each of the three scenarios considering (i) only climatic variables
(StabilityENV), (ii) climate-based habitat suitability
maps for host species (StabilityHOST), and (iii)
combining climatic variables and climate-based habitat suitability maps
for host species (StabilityENV-HOST). Raster
calculations were conducted using the R raster package (Hijmans & van
Etten, 2016).
Nuclear and mitochondrial diversity were characterized for each of the
28 sampling sites. Genetic diversity estimates were normalized to three
individuals per sampling location to avoid potential bias resulting from
uneven population sample size (range: 3-11 individuals). For nDNA,
expected heterozygosities (H E) and nucleotide
diversity (π) were calculated using the diveRsity R package (Keenan,
McGinnity, Cross, Crozier & Prodöhl, 2013), and DNAsp v6 (Rozas et al.,
2017), respectively. We also estimated π and haplotype diversity
(Hd) for mtDNA data using DNAsp v6 (Rozas et al., 2017).
Correlations between stability values and estimates of nuclear
(H E, π) and mtDNA (HD, π)
population genetic diversity were tested using linear
regressions. Longitude and latitude were also included as
explanatory factors to account for potential geographical clines of
genetic diversity (Guo, 2012).
Results
Genomic data
We successfully genotyped 138 individuals of C. cactorum using
ddRAD sequencing which were representative of populations feeding on
both exotic O. ficus-indica (71 individuals), and native host
plants O. megapotamica (34), O. rioplatense (11), O.
elata (10), O. bonaerensis (4), O. penicilligera (3),O. quimilo (3), and O. anacantha (2). After filtering
steps, the average number of paired-end reads retained per individual
was 1,859,116, of which an average of 1,253,277 reads mapped to theC. cactorum reference genome. After discarding loci in linkage
disequilibrium and under selection, we recovered a total of 3,506
unlinked sequence loci with an average coverage of 30X which contained
17,084 biallelic SNPs (Table S2). A total of 143 individuals from the
same 28 locations were successfully sequenced for a fragment of 790 bp
of the COI gene. The analysis of mtDNA data revealed 66
haplotypes, 171 variable sites, 97 of which were parsimony-informative
sample-wide.
Assessing population
genomic structure
Clustering analysis with sNMF indicated K=6 to be the most likely number
of genetic clusters: North, Central, Western, Southeast, Southwest and
East. However, these analyses revealed a certain degree of admixture
among clusters; some Northern and Central individuals exhibited a
relatively high degree of admixture with both the Western and Eastern
clusters (Figure 2A-C). Interestingly, Northern, Central and Western
populations appeared to form a unique cluster (Central lineage from
hereafter) when assuming K=3, whereas both Southern (South lineage from
hereafter) and Eastern (East lineage from hereafter) populations
remained as separated panmictic groups. With K=4, the Southern
population appeared subdivided into two geographic units; Southeastern
and Southwestern. Finally, with K=5, Northern populations appeared
separated from the Central population (Figure S2). Principal component
analysis (PCA) reveals the Eastern lineage to appear separated from the
Central lineage along PC1, whereas the South lineage can be
distinguished from the Central and Eastern lineages along PC2 (Figure
2D). Additional PCA structuring was also evident, mainly in the Southern
lineage in agreement with the clustering pattern observed with K=4 in
sNMF. Moreover, the PCA revealed an apparent admixture signal between
the Central and Eastern lineages.
The median-joining network obtained with mtDNA sequence data revealed
strong spatial genetic structure consistent with the clusters identified
with nDNA data (Figure 2E). The Central haplogroup was one of the most
frequent haplotypes and was separated by seven mutational steps from the
Eastern haplogroup, whereas four mutational steps separated the Central
and Northern haplogroups. Southeastern and Southwestern haplogroups
appeared close to the Central haplogroup and shared a common haplotype.
Despite spatial genetic structure, a few Central haplotypes were
distributed in other regions (three central haplotypes in the East, two
in the Southeast and one in the North). A frequent haplotype in the
Northern haplogroup was also detected in the central region and a
haplotype from the East was observed in the north region.
Species tree
reconstruction
Phylogenetic relationships inferred in SNAPP were consistent with the
hierarchical spatial genetic structure observed with sNMF (Figure 2B-C;
Figure S2). The most ancestral split corresponded to the separation of
South and East lineages from the Central lineage. Southwestern and
Southeastern clusters, as well as, West, Central and North clusters
diverged subsequently from their respective lineages, as observed in
clustering analyses. The maximum clade credibility tree indicated a high
posterior probability (> 0.9) supporting all nodes in the
species tree (Figure 2B).
Demographic inference using
coalescent-based simulations
The scenario considering full asymmetric interpopulation gene flow
(model 8) was the most supported among the nine demographic models
tested in Fastsimcoal2 (Table S3). Under this model, divergence among
the three main lineages of C. cactorum was estimated to have
occurred during the Late Pleistocene (Figure 3; Table 1). Specifically,
this model describes an initial split (TDIV1) of the
Central lineage ~75 kya (considering two generations per
year in native host species, as reported by Varone et al., 2014) and a
more recent split (TDIV2) separating the South and East
lineages occurring ~19 kya. Demographic simulations
estimated an effective population size close to 250,000 for the Central
lineage (NECENTER), which was six times higher than the
value obtained for the East lineage (NEEAST=41,551).
Estimates of gene flow rates varied among lineages with the highest rate
of contemporary gene flow estimated to occur from the Central to the
East lineage (MCE=1.50 x 10-5) and the
lowest from the Central lineage to the South lineages
(MCS=2.71 x 10-8) (Table 1).
Ecological niche modeling
through time
Ecological niche models (ENMs) for both the focal species (C.
cactorum ) and its host species presented relatively high AUC scores
(Table S4 and S5), indicating an overall good model performance.
Suitability maps of host species were concordant with their respective
current distributions (Figure S3). The three models for C.
cactorum (Set 1: CurrentENV, Set 2:
CurrentHOST, Set 3: CurrentENV-HOST)
provided similar predicted distributions, with no major differences
(Figure S4). However, the most supported model for C. cactorumaccording to AIC values was that considering as input variables the
predicted distribution of the host plants (Set 2), followed by the model
including these and the six bioclimatic variables (Set 3). Additional
information on the results of ENMs for both C. cactorum and host
species are detailed in Tables S4 and S5. The present-day distribution
inferred by the most supported scenario (CurrentHOST)
was in line with the contemporary range of C. cactorum . Under
this model, higher suitability areas were predicted in both Chaco and
north of Pampa biogeographic regions whereas areas of lower suitability
were predicted in the southern portions of the Pampas and the Monte
biogeographic regions (Figure 1B). Other biogeographic provinces
(Yungas, Puna and Prepuna) had extremely low values of suitability forC. cactorum . Projections of ENM to LGM predicted high
environmental suitability along the Chaco biogeographic province to the
most southern portion of Pampas, with areas of low suitability located
in eastern Argentina, a confluence zone of La Plata river basin (Figure
1C). Comparing within each scenario across time, from the LGM to the
present, areas of high environmental stability were predicted in North
and Central Argentina (the arid and semi-arid regions of Chaco), with
low environmental stability predicted in southern Argentina.
Landscape genomic analyses
Estimates of genetic differentiation (FST) among
sampling sites ranged from 0.023 to 0.448 for nDNA data (Table S6).
Univariate matrix regressions indicated that nuclear genetic
differentiation was significantly correlated with all distance matrices
except with CurrentENV and
CurrentENV-HOST (Table S7). Yet, only
LGMENV-HOST was significantly retained in the best-fit
model after the backward-stepwise procedure (Table 2).