Introduction: Changing data challenges and accelerating
needs
Understanding species ranges and developing approaches to reliably
monitor distributions are essential needs of any conservation target
(Hughes et al., 2021a). This is especially important in the context of
the Post-2020 Global Biodiversity Framework (GBF) and the accompanying
Monitoring
Framework, which aims to provide the metrics to measure progress towards
new targets. In recent years, biological data availability has
transformed, and we have moved from being data-poor to data-rich. There
are now 2.3 billion records on GBIF, over 1.3 billion in eBird, and over
130 million on iNaturalist, with citizen science data increasingly
outweighing prior museum specimen data for many taxa. Point-based data
enable sophisticated modelling, and analysis of traits and changes over
time in ways unimaginable previously (GBIF Secretariat 2021). Yet
greater access to data can mean that basic ecological principles are
forgotten, and analysis becomes merely a statistical exercise. Combined
with the incentivisation to publish high impact (Eyre-Walker 2013) and
global papers (Wyborn & Evans 2022), there is a temptation to focus on
headline titles, regardless of data extent, pervasive biases, or the
specific methods required to account for these issues (Hughes et al.,
2021b). Thereafter, once such resources are published, the path of least
resistance is to reuse them rather than reinventing the wheel
(unfortunately, this often happens even if the wheel is lopsided). More
problematic, the existence of studies claiming to do something reduces
the novelty of any subsequent studies improving prospects, greatly
reducing the potential for high-impact publication and thereby also
reducing incentives to make such improvements in the first place.
Furthermore, big “headlines” will be requoted and taken as accurate,
and commentaries and responses are hard to publish and are unlikely to
receive such attention; meaning that like anything on the internet today
once an inaccurate analysis is published, the headline may be taken as
dogma, and if they are inaccurate the consequences of those inaccuracies
are inevitably propagated.
Complicating matters, the discrepancy between data-rich and data-poor
regions generally mirrors that of GDP, and consequently the areas with
the richest biodiversity may also have the poorest coverage in terms of
biodiversity data (Giam et al., 2012; Stork 2018; Hughes et al., 2021b).
This means that “global studies” disproportionately represent patterns
in higher income economies. Reconciling these biases requires both
understanding and then working to overcome sampling related issues. Such
analyses are crucial across taxa, regions, and scales, forming the
foundation of effective National Strategic Biodiversity Action Plans
(NBSAPs) (Whitehorn et al., 2019, Schmidt-Traub, 2021).
Understanding data challenges and limits, and how and which methods can
be applied, is critical. This is because every form of analysis makes
its own assumptions, and every dataset has its own biases and
inconsistencies. Depending on what data are available, modelling
species-level data may only be locally possible for many taxa, and using
data beyond sensible bounds can misdirect priorities and misinform
management plans. Without an understanding of how data were generated or
their biases and shortcomings, the likelihood of misuse increases. Thus,
with this ever-growing wealth of data, it is critical to understand how
to use it effectively, and what processes to follow to ensure that
outcomes can meaningfully guide future conservation targets.
Here, we explore different methods commonly applied to biodiversity
data, the assumptions they make, and the impact of applying them to data
that do not meet those assumptions. We then discuss approaches and
frameworks representing best practices in biodiversity data analysis,
and provide a stepwise framework which can be used to ensure
biodiversity data are used within their limits and that the assumptions
of each step are clearly understood.