Introduction: Changing data challenges and accelerating needs
Understanding species ranges and developing approaches to reliably monitor distributions are essential needs of any conservation target (Hughes et al., 2021a). This is especially important in the context of the Post-2020 Global Biodiversity Framework (GBF) and the accompanying Monitoring Framework, which aims to provide the metrics to measure progress towards new targets. In recent years, biological data availability has transformed, and we have moved from being data-poor to data-rich. There are now 2.3 billion records on GBIF, over 1.3 billion in eBird, and over 130 million on iNaturalist, with citizen science data increasingly outweighing prior museum specimen data for many taxa. Point-based data enable sophisticated modelling, and analysis of traits and changes over time in ways unimaginable previously (GBIF Secretariat 2021). Yet greater access to data can mean that basic ecological principles are forgotten, and analysis becomes merely a statistical exercise. Combined with the incentivisation to publish high impact (Eyre-Walker 2013) and global papers (Wyborn & Evans 2022), there is a temptation to focus on headline titles, regardless of data extent, pervasive biases, or the specific methods required to account for these issues (Hughes et al., 2021b). Thereafter, once such resources are published, the path of least resistance is to reuse them rather than reinventing the wheel (unfortunately, this often happens even if the wheel is lopsided). More problematic, the existence of studies claiming to do something reduces the novelty of any subsequent studies improving prospects, greatly reducing the potential for high-impact publication and thereby also reducing incentives to make such improvements in the first place. Furthermore, big “headlines” will be requoted and taken as accurate, and commentaries and responses are hard to publish and are unlikely to receive such attention; meaning that like anything on the internet today once an inaccurate analysis is published, the headline may be taken as dogma, and if they are inaccurate the consequences of those inaccuracies are inevitably propagated.
Complicating matters, the discrepancy between data-rich and data-poor regions generally mirrors that of GDP, and consequently the areas with the richest biodiversity may also have the poorest coverage in terms of biodiversity data (Giam et al., 2012; Stork 2018; Hughes et al., 2021b). This means that “global studies” disproportionately represent patterns in higher income economies. Reconciling these biases requires both understanding and then working to overcome sampling related issues. Such analyses are crucial across taxa, regions, and scales, forming the foundation of effective National Strategic Biodiversity Action Plans (NBSAPs) (Whitehorn et al., 2019, Schmidt-Traub, 2021).
Understanding data challenges and limits, and how and which methods can be applied, is critical. This is because every form of analysis makes its own assumptions, and every dataset has its own biases and inconsistencies. Depending on what data are available, modelling species-level data may only be locally possible for many taxa, and using data beyond sensible bounds can misdirect priorities and misinform management plans. Without an understanding of how data were generated or their biases and shortcomings, the likelihood of misuse increases. Thus, with this ever-growing wealth of data, it is critical to understand how to use it effectively, and what processes to follow to ensure that outcomes can meaningfully guide future conservation targets.
Here, we explore different methods commonly applied to biodiversity data, the assumptions they make, and the impact of applying them to data that do not meet those assumptions. We then discuss approaches and frameworks representing best practices in biodiversity data analysis, and provide a stepwise framework which can be used to ensure biodiversity data are used within their limits and that the assumptions of each step are clearly understood.