Inserts of DNA from extranuclear sources, such as organelles and microbes, are common in eukaryote nuclear genomes. However, sequence similarity between the nuclear and extranuclear DNA, and a history of multiple insertions, make the assembly of these regions challenging. Consequently, the number, sequence, and location of these vagrant DNAs cannot be reliably inferred from the genome assemblies of most organisms. We introduce two statistical methods to estimate the abundance of nuclear inserts even in the absence of a nuclear genome assembly. The first (intercept method) only requires low-coverage (<1x) sequencing data, as commonly generated for population studies of organellar and ribosomal DNAs. The second method additionally requires that a subset of the individuals carry extra-nuclear DNA with diverged genotypes. We validated our intercept method using simulations and by re-estimating the frequency of human NUMTs (nuclear mitochondrial inserts). We then applied it to the grasshopper Podisma pedestris, exceptional for both its large genome size and reports of numerous NUMT inserts, estimating that NUMTs make up 0.056% of the nuclear genome, equivalent to >500 times the mitochondrial genome size. We also re-analysed a museomics dataset of the parrot Psephotellus varius, obtaining an estimate of only 0.0043%, in line with reports from other species of bird. Our study demonstrates the utility of low-coverage high-throughput sequencing data for the quantification of nuclear vagrant DNAs. Beyond quantifying organellar inserts, these methods could also be used on endosymbiont-derived sequences. We provide an R implementation of our methods called “vagrantDNA” and code to simulate test datasets.
Understanding landscape connectivity has become a global priority for mitigating the impact of landscape fragmentation on biodiversity. Link-based methods traditionally rely on relating pairwise genetic distance between individuals or demes to their landscape distance (e.g., geographic distance, cost distance). In this study, we present an alternative to conventional statistical approaches to refine cost surfaces by adapting the Gradient Forest (GF) approach to produce a resistance surface. Used in community ecology, GF is an extension of random forest (RF), and has been implemented in genomic studies to model species genetic offset under future climatic scenarios. By design, this adapted method, resGF, has the ability to handle multiple environmental predicators and is not subjected to traditional assumptions of linear models such as independence, normality and linearity. Using genetic simulations, resGF performance was compared to other published methods. In univariate scenarios, resGF was able to distinguish the true surface contributing to genetic diversity among competing surfaces better than the compared methods. In multivariate scenarios, the GF approach performed similarly to the other RF-based approach using least-cost transect analysis (LCTA). Additionally, two worked examples are provided using two previously published datasets. This machine learning algorithm has the potential to improve our understanding of landscape connectivity and can inform long-term biodiversity conservation strategies.
Age is an essential trait for understanding the ecology and management of wildlife. A conventional method of estimating age in wild animals is counting annuli formed in the cementum of teeth. This method has been used in bears despite some disadvantages, such as high invasiveness and the requirement for experienced observers. In this study, we established a novel age estimation method based on DNA methylation levels using blood collected from 49 brown bears of known ages living in both captivity and the wild. We performed bisulfite pyrosequencing and obtained methylation levels at 39 cytosine-phosphate-guanine (CpG) sites adjacent to 12 genes. The methylation levels of CpGs adjacent to four genes showed a significant correlation with age. The best model was based on DNA methylation levels at just four CpG sites adjacent to a single gene, SLC12A5, and it had high accuracy with a mean absolute error of 1.3 years and median absolute error of 1.0 year after leave-one-out cross-validation. This model represents the first epigenetic method of age estimation in brown bears, which provides benefits over tooth-based methods, including high accuracy, less invasiveness, and a simple procedure. Our model has the potential for application to other bear species, which will greatly improve ecological research, conservation, and management.
A large part of the soil protist diversity is missed in metabarcoding studies based on 0.25 g of soil environmental DNA (eDNA) and universal primers due to ca. 80 % co-amplification of non-target plants, animals and fungi. To overcome this problem, enrichment of the substrate used for eDNA extraction is an easyly implemented option but its effect has not yet been tested. In this study, we evaluated the effect of a 150 µm mesh size filtration and sedimentation method to improve the recovery of protist eDNA, while reducing the co-extraction of plant, animal and fungal eDNA, using a set of contrasted forest and alpine soils from La Réunion, Japan, Spain and Switzerland. Biodiversity of the whole eukaryotic community was estimated with V4 18S rRNA metabarcoding and classical amplicon sequence variant calling. A 2-3-fold enrichment in shelled protists (Euglyphida, Arcellinida and Chrysophyceae) was observed at the sample level with the proposed method, with, at the same time, a 2-fold depletion of Fungi and a 3-fold depletion of Embryophyceae. Protist alpha diversity was slightly lower in filtered samples due to reduced coverage in Variosea and Sarcomonadea, but significant differences were observed in only one region. Beta diversity was mostly impacted by region and habitat, and explained the same variance in bulk soil and filtered samples. The increase resolution in the soil protist diversity provided by the filtration-sedimentation method is a strong argument to include it in the standard preparation of any future soil for protist eDNA metabarcoding studies.
In the face of global biodiversity declines, surveys of beneficial and antagonistic arthropod diversity as well as the ecological services that they provide are increasingly important in both natural and agro-ecosystems. Conventional survey methods used to monitor these communities often require extensive taxonomic expertise and are time-intensive, potentially limiting their application in industries such as agriculture, where arthropods often play a critical role in productivity (e.g. pollinators, pests and predators). Environmental DNA (eDNA) metabarcoding of a novel substrate, crop flowers, may offer an accurate and high throughput alternative to aid in the detection managed and unmanaged arthropod taxa (e.g. flower-visiting insects and potential pollinators). Here, we compared the arthropod communities detected with eDNA metabarcoding of flowers, from an agricultural species (Persea americana - ‘Hass’ avocado), with two conventional survey techniques; Digital Video Recording (DVR) devices and pan traps. In total, 80 eDNA flower samples, 96 hours of DVRs and 48 pan trap samples were collected. Across the three methods, 49 arthropod families were identified, of which 12 were unique to the eDNA dataset. Alpha diversity levels did not differ across the three survey methods although taxonomic composition varied significantly, with only 12% of arthropod families found to be common across all three methods. This study demonstrates that eDNA metabarcoding of flowers to detect visiting arthropods, although in a developmental stage, can complement traditional survey methods and increase the diversity of taxa detected with implications for both natural and agro-ecosystems.
Genotype environment association (GEA) studies have the potential to identify the genetic basis of local adaptation in natural populations. Specifically, GEA approaches look for a correlation between allele frequencies and putatively selective features of the environment. Genetic markers with extreme evidence of correlation with the environment are presumed to be tagging the location of alleles that contribute to local adaptation. In this study, we propose a new method for GEA studies called the weighted-Z analysis (WZA) that combines information from closely linked sites into analysis windows in a way that was inspired by methods for calculating FST. We analyze simulations modelling local adaptation to heterogeneous environments to compare the WZA with existing methods. In the majority of cases we tested, the WZA either outperformed single-SNP based approaches or performed similarly. In particular, the WZA outperformed individual SNP approaches when a small number of individuals or demes was sampled. We apply the WZA to previously published data from lodgepole pine and identified candidate loci that were not found in the original study.
There is growing interest in the role of structural variants (SVs) as drivers of local adaptation and speciation. From a biodiversity genomics perspective, the characterisation of genome-wide SVs provides an exciting opportunity to complement single nucleotide polymorphisms (SNPs). However, little is known about the impacts of SV discovery and genotyping strategies on the characterisation of genome-wide SV diversity within and among populations. Here, we explore a near whole-species resequence dataset, and long-read sequence data for a subset of highly represented individuals in the critically endangered kākāpō (Strigops habroptilus). We demonstrate that even when using a highly contiguous reference genome, different discovery and genotyping strategies can significantly impact the type, size and location of SVs characterised genome-wide. Further, we found that the mean number of SVs in each of two kākāpō lineages differed both within and across generations. These combined results suggest that genome-wide characterisation of SVs remains challenging at the population-scale. We are optimistic that increased accessibility to long-read sequencing and advancements in bioinformatic approaches including multi-reference approaches like genome graphs will alleviate at least some of the challenges associated with resolving SV characteristics below the species level. In the meantime, we address caveats, highlight considerations, and provide recommendations for the characterization of genome-wide SVs in biodiversity genomic research.
Understanding the evolutionary consequences of anthropogenic change is imperative for estimating long-term species resilience. While contemporary genomic data can provide us with important insights into recent demographic histories, investigating past change using present genomic data alone has limitations. In comparison, temporal genomics studies, defined herein as those that incorporate time series genomic data, leverage museum collections and repeated field sampling to directly examine evolutionary change. As temporal genomics is applied to more systems, species, and questions, best practices can be helpful guides to make the most efficient use of limited resources. Here, we conduct a systematic literature review to synthesize the effects of temporal genomics methodology on our ability to detect evolutionary changes. We focus on studies investigating recent change within the past 200 years, highlighting evolutionary processes that have occurred during the past two centuries of accelerated anthropogenic pressure. We first identify the most frequently studied taxa, systems, questions, and drivers, before highlighting overlooked areas where further temporal genomic studies may be particularly enlightening. Then, we provide guidelines for future study and sample designs while identifying key considerations that may influence statistical and analytical power. Our aim is to provide recommendations to a broad array of researchers interested in using temporal genomics in their work.
Although plastid genome (plastome) structure is highly conserved across most seed plants, investigations during the past two decades have revealed several disparately related lineages that have experienced substantial rearrangements. Most plastomes have two inverted repeat regions and two single-copy regions with few dispersed repeats. However, the plastomes of some taxa do harbor long repeat sequences (>300 bp). These long repeats make it difficult to assemble complete plastomes using short read data, leading to misassemblies and consensus sequences that have spurious rearrangements. Long read sequencing can potentially overcome these challenges. However, there is no consensus as to the most effective method for accurately assembling plastomes using long read data. Here, we generated a pipeline, plastid Genome Assembly Using Long-read data (ptGAUL) to address the problem of assembling of plastomes using long read data from Oxford Nanopore Technologies (ONT) or Pacific Biosciences (Pacbio) platforms. We demonstrated the efficacy of the ptGAUL pipeline using 16 published long read datasets. We showed that ptGAUL produces accurate and unbiased assemblies. Additionally, we applied ptGAUL to assemble four Juncus (Juncaceae) plastomes using ONT long reads. Our results revealed many long repeats and rearrangements in Juncus plastomes compared with basal lineages of Poales.
Dispersal is a crucial mechanism to living beings, allowing them to reach new resources such that populations and species can explore new environments. However, directly observing the dispersal mechanisms of widespread species can be costly or even impracticable, which is the case for mangrove trees. The influence of ocean currents on the mangroves’ propagules’ movement has been increasingly evident; however, few studies mechanistically relate the patterns of population distribution with the dispersal by oceanic currents under an integrated framework. Here, we evaluate the role of oceanic currents on dispersal and connectivity of Rhizophora mangle along the Southwest Atlantic. We inferred population genetic structure and migration rates based on single nucleotide polymorphisms, simulated the displacement of propagules along the region and tested our hypotheses with Mantel tests and redundancy analysis. We observed a two populations structure, north and south, which is corroborated by other studies with Rhizophora and other coastal plants. The inferred recent migration rates do not indicate gene flow between the sampled sites. Conversely, long-term migration rates were low across groups and contrasting dispersal patterns within each one, which is consistent with long-distance dispersal events. Our hypothesis tests suggests that both isolation by distance and isolation by oceanography (derived from the oceanic currents) can explain the neutral genetic variation of R. mangle in the region. Our findings expand current knowledge of mangrove connectivity and highlight how the association of molecular methods with oceanographic simulations improve the interpretation power of the dispersal process, which has ecological and evolutionary implications.
Genomics can play important roles in biodiversity conservation, especially for Extinct-in-the-Wild species where genetic factors can influence total extinction risk and probability of successful reintroductions. The Christmas Island blue-tailed skink (Cryptoblepharus egeriae) and Lister’s gecko (Lepidodactylus listeri) are two endemic reptile species that went extinct in the wild shortly after the introduction of a predatory snake. After a decade of management, captive populations have expanded from 66 skinks and 43 geckos to several thousand individuals; however, little is known about patterns of genetic variation in these species. Here, we use PacBio HiFi long-read and Hi-C sequencing to generate contiguous reference genomes for both species, including the XY chromosome pair in the skink. We then analyze patterns of genetic diversity to infer ancient demography and more recent histories of inbreeding. We observe high genome-wide heterozygosity in the blue-tailed skink (0.007) and Lister’s gecko (0.005), consistent with large historical population sizes. However, nearly 10% of the skink reference genome falls within long runs of homozygosity (ROH), resulting in homozygosity at all major histocompatibility complex (MHC) loci, whereas we detect only a single ROH in the gecko. We infer from the ROH lengths that related skinks may have established the captive populations. Despite a shared recent extinction in the wild, our results suggest important differences in species’ histories and implications for management. We show how reference genomes can provide evolutionary and conservation insights in the absence of resequencing data, and we provide a resource for future population-level and comparative genomic studies in reptiles.
Lifespan is a key attribute of a species’ life cycle and varies extensively among major lineages of animals. In fish, lifespan varies by several orders of magnitude, with reported values ranging from less than one year to approximately 400 years. Lifespan information is particularly useful for species management, as it can be used to estimate invasion potential, extinction risk and sustainable harvest rates. Despite its utility, lifespan is unknown for most fish species. This is due to the difficulties associated with accurately identifying the oldest individual(s) of a given species, and/or deriving lifespan estimates that are representative for an entire species. Recently it has been shown that CpG density in gene promoter regions can be used to predict lifespan in mammals and other vertebrates, with variable accuracy across taxa. To improve accuracy of lifespan prediction in a non-mammalian vertebrate, here we develop a fish-specific genomic lifespan predictor. Addressing previous issues of low sample size and sequence dissimilarity, we incorporate more than eight times the number of fish species used previously (n = 442) and use fish-specific gene promoters as reference sequences. Our model predicts fish lifespan from genomic CpG density alone (measured as CpG observed/expected ratio), explaining 64 % of the variance between known and predicted lifespans. The results demonstrate the value of promoter CpG density as a universal predictor of fish lifespan that can applied where empirical data are unavailable, or impracticable to obtain.
Soil protists are increasingly studied due to a release from previous methodological constraints and the acknowledgement of their immense diversity and functional importance in ecosystems. However, these studies often lack a sufficient depth in knowledge, which is visible in the form of falsely used terms and false- or over-interpreted data with conclusions that cannot be drawn from the data obtained. As we welcome that also non-experts include protists in their still mostly bacterial and/or fungal focused studies, our aim here is to help avoid some common errors. We provide an overview of current terms to be used when working on soil protists, like protist instead of protozoa, predator instead of grazer, microorganisms rather than microflora and terms to be used to describe the prey spectrum of protists. We then highlight some do’s and don’ts in soil protist ecology including challenges related to interpreting 18S rRNA gene amplicon sequencing data. We caution against the use of standard bioinformatic settings optimized for bacteria and the uncritical reliance on incomplete and partly erroneous reference databases. We also show why causal inferences cannot be drawn from sequence-based correlation analyses or any sampling/monitoring, study in the field without thorough experimental confirmation and sound understanding of the biology of taxa. Together, we envision this work to help non-experts to more easily include protists in their soil ecology analyses, and obtain more reliable interpretations from their protist data and other biodiversity data that, in the end, will help to better understand soil ecology.
We analyzed robustness of species identification based on proteomic composition to data processing and intraspecific variability, specificity and sensitivity of species-markers as well as discriminatory power of proteomic fingerprinting and its sensitivity to phylogenetic distance. Our analysis is based on MALDI-TOF MS data from 32 marine copepod species coming from 13 regions (North and Central Atlantic and adjacent seas). A random forest (RF) model correctly classified all specimens to species level with only small sensitivity to data processing, demonstrating the strong robustness of the method. Compounds with high specificity showed low sensitivity i.e., identification was rather based on complex pattern-differences than on presence of single markers. Proteomic distance was not consistently related to phylogenetic distance. A species-gap in proteome composition appeared at 0.8 Euclidean distance when using only specimens from the same sample. When other regions or seasons were included, intra-specific variability increased, resulting in overlaps of intra- and inter-specific distance. Highest intra-specific distances (> 0.8) were observed between specimens from brackish and marine habitats i.e., salinity likely affects proteomic patterns. When testing library sensitivity of the RF model to regionality, strong misidentification was only detected between two congener pairs. Still, choice of reference library may have an impact on identification of closely related species and should be tested before routine application. We envision high relevance of this time- and cost-efficient method for future zooplankton monitoring as it provides not only in-depth taxonomic resolution for counted specimens but also add-on information e.g., on developmental stage or environmental conditions.
Despite the increasing accessibility of high-throughput sequencing, obtaining high-quality genomic data on non-model organisms without proximate well-assembled and annotated genomes remains challenging. Here we describe a workflow that takes advantage of distant genomic resources and ingroup transcriptomes to select and jointly enrich long open reading frames (ORFs) and ultraconserved elements (UCEs) from genomic samples for integrative studies of microevolutionary and macroevolutionary dynamics. This workflow is applied to samples of the African unionid bivalve tribe Coelaturini (Parreysiinae) at basin and continent-wide scales. Our results indicate that ORFs are efficiently captured without prior identification of intron-exon boundaries. The enrichment of UCEs was less successful, but nevertheless produced substantial datasets. Exploratory continent-wide phylogenetic analyses with ORF supercontigs (>515,000 parsimony informative sites) resulted in a fully resolved phylogeny, the backbone of which was also retrieved with UCEs (>11,000 informative sites). Variant calling on ORFs and UCEs of Coelaturini from the Malawi Basin produced ~2,000 SNPs per population pair. Estimates of nucleotide diversity and population differentiation were similar for ORFs and UCEs. They were low compared to previous estimates in mollusks, but comparable to those in recently diversifying Malawi cichlids and other taxa at an early stage of speciation. Skimming off-target sequence data from the same enriched libraries of Coelaturini from the Malawi Basin, we reconstructed the maternally-inherited mitogenome, which displays the gene order inferred for the most recent common ancestor of Unionidae. Overall, our workflow and results provide exciting perspectives for integrative genomic studies of microevolutionary and macroevolutionary dynamics in non-model organisms.
Revegetation projects face the major challenge of sourcing the optimal plant material. This is often done with limited information about plant performance and increasingly requires to factor resilience to climate change. Functional traits can be used as quantitative indices of plant performance and guide provenancing, but trait values expected under novel conditions are often unkown. To support climate-resilient provenancing efforts, we develop a trait prediction model that integrates the effect of genetic variation with fine-scale temperature variation. We train our model on multiple field plantings of Arabidopsis thaliana and predict two relevant fitness traits -- days-to-bolting and fecundity -- across the species' European range. Prediction accuracies were high for days-to-bolting and moderate for fecundity, with the majority of trait variation explained by temperature differences between plantings. Projection under future climate predicted a decline in fecundity, although this response was heterogeneous across the range. In response, we identified novel genotypes that could be introduced to genetically offset the fitness decay. Our study highlights the value of predictive models to aid seed provenancing and improve the success of revegetation projects.
Innovations in ancient DNA (aDNA) preparation and sequencing technologies have exponentially increased the quality and quantity of aDNA data extracted from ancient biological materials. The additional temporal component from the incoming aDNA data can provide improved power to address fundamental evolutionary questions like characterising selection processes that shape the phenotypes and genotypes of contemporary populations or species. However, utilising aDNA to study past selection processes still involves considerable hurdles such as how to eliminate the confounding effect of genetic interactions in the inference of selection. To circumvent this challenge, in this work we extend the method introduced by He et al. (2022) to infer temporally variable selection from the data on aDNA sequences with the flexibility of modelling linkage and epistasis. Our posterior computation is carried out through a robust adaptive version of the particle marginal Metropolis-Hastings algorithm with a coerced acceptance rate. Moreover, our extension inherits their desirable features like modelling sample uncertainties resulting from the damage and fragmentation of aDNA molecules and reconstructing underlying gamete frequency trajectories of the population. We assess the performance and show the utility of our procedure with an application to ancient horse samples genotyped at the loci encoding base coat colours and pinto coat patterns.
Over the last two decades, there has been a huge increase in our understanding of microbial diversity, structure and composition enabled by high throughput sequencing (HTS) technologies. Yet, it is unclear how the number of sequences translates to the number of cells or species within the community. Additional observational data may be required to ensure relative abundance patterns from sequence reads are biologically meaningful or presence absence data may be used instead of abundance. The goal is to obtain robust community abundance data, simultaneously, from environmental samples. In this issue of Molecular Ecology Resources, Karlusich et al., (2022) describe a new method for quantifying phytoplankton cell abundance. Using Tara Oceans datasets, the authors propose the photosynthetic gene psbO for reporting accurate relative abundance of the entire phytoplankton community from metagenomic data. The authors demonstrate improved correlations with traditional optical methods including microscopy and flow cytometry, improving upon current molecular identification typically using rRNA markers genes. Furthermore, to facilitate application of their approach, the authors curated a psbO gene database for accessible taxonomic queries. This is an important step towards improving species abundance estimates from molecular data and eventually reporting of absolute species abundance, enhancing our understanding of community dynamics.