Context: The current data landscape
Biodiversity mapping is a central tenet of conservation, and it is
achieved either through the use of point data or expert range maps.
First came basic repositories of museum point data such as Arctos in
1996 (Jarrell et al., 2010), followed by GBIF in 2001 and OBIS for Ocean
systems in 2002. More recently, a wealth of citizen science platforms
have grown to provide a greater spatial, temporal, and taxonomic volume
of data than has ever previously been available. Such data promised to
revolutionise our understanding of global biodiversity patterns,
enabling us to finally move away from the hand-drawn range maps in
Birdlife, IUCN, GARD and Fishbase (all of which aggregate data and map
species distributions), which themselves represented a major step
forwards in providing any spatial data to map species ranges for
thousands of species (Hughes et al., 2021c). Most global papers continue
to use these often hand-drawn polygon maps to visualise priorities or
assess gaps because they are easy to download and simple to use, despite
studies highlighting substantial scale-dependence and errors of both
omission and commission (Herkt et al. 2017; Li et al., 2020; Hughes et
al 2021c). Growing volumes of point data provide an option of better,
and more accurate species ranges analysis than polygon approaches, yet
these empirical point-based data present a different challenge; how can
we use these data without over-extending their limits?
As point data continue to increase in volume, so too will the number and
type of biases within those data, making their reconciliation more
complex. At the same time, the drive to perform large-scale mapping is
increasing (Wyborn & Evans 2021). There has never before been such a
dire need for these data, with the failure to complete the Aichi
Biodiversity Targets, the delays in most countries completing their
National Strategic Biodiversity and Action Plans (and the need for
revisiting these plans), and the Monitoring Framework. Understanding
where we have data, what biases exist in those data, and how they can be
overcome is critical to even begin moving towards the goals of the GBF
(Mace et al. 2018). Modelling and mapping has previously even been the
subject of major assessments (i.e. IPBES 2018), and thus whilst there is
guidance on model approaches, understanding how different forms of data,
such as species distributions, can be effectively used is essential.
Understanding
the dimensions of biases and their consequences
Accurate mapping of species ranges has some basic tenets that must be
followed to produce reliable results (Malavasi 2020). Depending on the
amount and quality of available data, increasingly sophisticated
approaches can be used, but even basic methods have great potential when
used carefully and appropriately. For the most basic analyses, data must
still be representative, with minimal biases, but data in popular
databases were largely collected from areas within 2km of roads (Hughes
et al., 2021b), meaning generalist and disturbance-tolerant species will
be over-represented and relatively few areas will be protected in most
cases (given the higher level of development near mapped roads; Hughes
2018). These biases can be even stronger in citizen science data or
social media data, and must be carefully accounted for (Bird et al.,
2014; Johnston et al. 2020; Hughes et al. 2021b; Barber et al., 2022;
Chowdhury et al., 2023a).
Common uses of biodiversity data include species-level (i.e. IUCN Red
List) to regional and global mapping. Thus, spatial and taxonomic biases
will be a particular problem with biodiversity mapping approaches which
do not include point projection or extrapolation to circumvent data gaps
and biases. On a species-level, approaches such as extent of occurrence
and area of occurrence (EOO and AOO; which bases ranges on minimum
spanning polygons and occupied cells, respectively) will
disproportionately represent the most disturbed areas where most data
exist, and whilst they are only sometimes calculated with available
point data (unless further additional data can be collated for
verification), they are a core component of IUCN range assessments.
Further, without careful data cleaning, errors can obscure any patterns
or trends, as common errors (switched coordinates, incorrect
georeferences, misidentifications, etc.) may mislay species distribution
points across different continents, even different hemispheres, and a
projective approach consequently risks overestimating distributions
across much of the global land-surface. For all these reasons, to assess
if a dataset can be used for meaningful analysis, we first need to
determine if the data are taxonomically or spatially representative, if
it represents the current ranges of species, and how most data were
generated, before assessing if methods used for assessing ranges can
give a useful understanding of species ranges.
Globally, biases are pervasive, and representative and accessible data
simply do not exist for many taxa, especially hyper-diverse and
challenging taxa such as insects (Garcia-Rosello et al., 2023). Most
data come from high-income countries, despite only a minority of species
coming from these regions (Hughes et al., 2021; Orr et al., 2020). For
example, Asian bees represent 15% of globally described species, but
only 1% of data (Orr et al., 2020). Notably, problematic knowledge gaps
still remain even in some of the best-known areas like Europe (Leandro
et al. 2017). Ultimately, any research attempting global-scale analysis
based on point data would strongly over-represent patterns in
high-income economies and better-known regions within them (Orr et al.,
2021), though different data-sharing policies from different countries
or institutes may alter completeness of available data. Countries and
regions may limit both data availability, and permitting for research,
both of which can alter the availability and therefore
representativeness of data, particularly in developing economies. These
policies may also limit collaborative efforts, particularly between
neighbouring countries, and make transnational analysis particularly
challenging due to a lack of standardisation between regional
data-collection efforts. Thus, equitable international partnerships are
a paramount consideration for research going forward, and understanding
how to reconcile differences is crucial to best use data that has
already been collected.
Taking a recent example, Chowdhury et al., 2023b) aimed to assess the
adequacy protection coverage of protected areas for all insects. Using
their occurrence data DOI, we can see that 69% of their data came from
Europe, and 21% from North America, whereas conversely each of Africa,
Asia, Oceania, and South America only contributed about 2% of records
each, despite having much greater diversity (Orr et al., 2020). This is
consistent with other analyses across taxa, highlighting why
interpolation requires extreme caution when interpreting global trends
(i.e. Orr et al., 2021; Hughes et al., 2021b); a recent attempt at
global bee decline analysis faced similar geographic biases and did not
account for them (Zattara & Aizen 2021), and even many regional
analyses (e.g Kerr et al., 2015) do not adequately account for potential
changes in collector aims and behaviours over time. There are similar
biases in taxonomic representation (Chowdhury et al., 2023b); of the
eight genera with over one million points in GBIF, seven are
butterflies, and in fact 51% of all GBIF invertebrate data are for
Lepidoptera. In another example, Bolam et al. (2023) analyzed most
IUCN-assessed threatened species and claimed that over half of
threatened species require recovery actions, but which species have been
assessed is itself biased by spatial effort for most groups, as well as
only representing a subset of species (Hughes et al., 2021a), so these
results may not be generalizable. This inability to generalise given the
lack of spatial or taxonomic representation may be the case for many
studies using subsets of total diversity for which there are sufficient
data (Visconti et al. 2016; Pacifici et al. 2020). Furthermore,
taxonomic expertise may be limited, and in certain taxa, reliance even
on single experts may impact the reliability of results across space and
time. Given acknowledged spatial biases (Hughes et al. 2021b;
Rocha-Ortega et al. 2021), many of these species likely already occupy
human-modified areas and would benefit little from protected areas, or
might even do worse if they show anthropophilic tendencies. Furthermore,
overall 80% of insect families have under 10% of species covered, and
the remaining 20% have 11-13%, meaning that no generalizable
conclusions can be drawn when the majority of species are not covered
(because those with sufficient data do not largely represent rarer
species). In terms of how records were generated, 62% were human
observation (majority citizen science data) and 31% were specimens
(from possibly centuries ago, as a time filter was not applied). Some
specimens might even be fossils of species which are still extant (as
fossil records were downloaded and only extinct fossil species were
removed).
Citizen science data can also complicate matters when used uncritically,
yet such data makes up most contemporary data for many groups. Taxonomic
biases are amplified in these data, with birds clearly dominating
(Dobson et al., 2020; Di Cecco et al., 2021), but other groups may also
exhibit biases. For example, 11% of all insect data from Chowdhury et
al. (2023b) was just from UK butterfly and moth citizen science
monitoring programs, and a further 10% were from global citizen science
programs. This would exacerbate the aforementioned biases of collections
in developed areas (Hughes et al. 2021b), as seen for bees but was
unaccounted for in Zattara & Aizen (2021). Citizen science data are
useful, especially for phenological monitoring, but occur
disproportionately in developed areas and for common and easy to
recognise species, especially of “charismatic” or “beautiful”
species, meaning that, without careful steps to correct for or
counteract biases, these data can compound biases and not necessarily
improve our knowledge (Dickinson et al. 2012; Bird et al. 2014; Ward
2014). This can also impact the ability of such data to detect trends
(Kamp et al., 2016). Generally, regional differences in organisations
and their taxonomic foci means that these biases may vary by region; the
UK in particular has a huge aggregation of Lepidoptera data, despite not
having particularly high diversity relative to other regions. This is
further complicated by differences in data sharing policies. For
example, iNaturalist is free to use, but the Bees, Wasps & Ants
Recording Society of the UK shares data publicly only on 10 km grids
and, as such, cannot be included in GBIF downloads
(https://www.bwars.com/content/bwars-data-download).
These issues mean that the public data for insects are not spatially or
taxonomically representative, and cannot be regarded as such, so
approaches to circumvent these issues require either resampling or
reprojection methods if we wish to reconstruct meaningful global, or
even regional patterns (i.e. Orr et al., 2021). Given the spatial and
taxonomic gaps, clear assessments of representativeness of data are
needed, especially in tropical regions (Giam et al., 2012; de Araujo et
al., 2022), and supplemental data may be needed for the most basic view
of many taxa. Many regions have disproportionate volumes of data with
different regional biases, for example Chowdhury et al., (2023b) have
largely provided a metric with a disproportionate emphasis on European
Lepidoptera. However, even within Europe, further assessment of the
percentage of the area would be needed to judge the true degree of
protection, unless additional modelling and calibration were used to
reconcile the biases within these regions. Stating any result beyond
these regions and taxa risks being misleading, while giving the
impression that we already have sufficient data for such types of global
analyses, which we simply do not for many taxa (Wyborn & Evans, 2021).
Measuring
diversity in the face of biases
Following the case study outlined above, it is clear that there are
right and wrong ways to carry out and interpret analyses. Whilst some
shortcuts may have relatively little impact, others may invalidate
interpretations when assumptions are not met. It is also worth
considering the cumulative contribution of smaller biases in undermining
accurate interpretation. Now we come to the methods: how can we use
spatially and taxonomically biased data to recover biodiversity
patterns? Avoiding biases entirely is virtually impossible in large
datasets, so finding methods to minimise their impacts is critical.
Different researchers and organizations employ different methods, but
consistency is important to enable comparison. One of the major methods
used by the IUCN in their generation of species ranges (as part of IUCN
assessments) is that of the extent of occurrence and area of occupancy
(AOO, which is the occupied subset of the EOO-which is an MCP minimum
convex polygon) to quantify species distributions. Such methods have
also been employed in some research articles (Bradshaw et al., 2014;
Chowdhury et al., 2023), sometimes with useful improvements that help
alleviate biases (Kass et al., 2021). However, mapping these ranges
requires a certain level of data confidence and completeness in existing
distributional data, and the uses of either technique are limited if
systematically collated data were not used for mapping ranges. In such
cases, the AOO approach cannot be usefully applied alone because most
points are from developed and disturbed areas (for many species, we
would not expect the areas inhabited to be protected as they are less
likely to represent high-quality habitat targeted for such protections).
Thus, if an AOO based on point data is used, a null-dataset (bias
weighting and assessment of representativeness) may be needed to ensure
that surveyed cells are adequate, as biased sampling again means that
more developed and less protected areas will have more species data.
Areas of occurrence would therefore require further stratified sampling,
such as grid- or transect-based approaches, or would need to rely on
percentage presence approaches (the percentage of surveyed cells that a
species were present in) to try to assess patterns; and may be entirely
unsuitable if the degree of data coverage is too low. Furthermore, the
AOO system of analysis has basic requirements to be performed well;
based on inventory of species presence within the estimated range; and
such a requirement is unlikely to be met in global analysis where degree
of sample coverage will underestimate occurrence wherever data gaps
occur.
Methods aiming to understand entire species ranges (and with inadequate
data for occupancy-like approaches) may attempt to use point-based data
to assess entire species ranges, but this requires extreme caution. For
interpolation-based approaches, points must be an accurate reflection of
species ranges, and multiple measures are needed to curate the data and
ensure that they are accurate. Because of the source of data and
possible encoding errors, spatial filters are needed to ensure that data
are accurate. The development and cleaning of databases is an important,
but challenging endeavour, but is needed to enable higher-resolution
analysis. Filters could include admin-area checklists (following
correction of synonyms using a curated list) or hemisphere filters (Orr
et al., 2021) to remove points where coordinate errors may exist or
points may be in private collections or even zoos. Chowdhury et al.
(2023) used CoordinateCleaner (Zizka et al. 2019) to filter for
occurrence quality (Table S1). Without clearer filters for realms or
continents this still leaves considerable room for errors. Recently, two
more complimentary packages, bdc (Ribeiro et al. 2022) and BeeBDC (Dorey
et al. In review ; Table S1), have been released that address
additional and complementary quality checks. A failure to adequately
clean and filter data can substantially inflate species ranges.
Additionally, even after filtering, data must be examined and analysed
with a critical eye and with hypotheses kept in mind to ensure that
results fall within the limits of what the data permit.
Conversely, there can be issues if a range is calculated based on a
minimum number of points without spatial thinning or removal of
duplicates when calculating the number. For example, Chowdhury et al.
(2023) aimed to assess the percentage of ranges protected for all
insects but analysed 217 species with a polygon area of “3” and 1105
with under “10” (presumedly kilometres); these low values suggest that
many species ranges are underestimated, possibly reported only from
small-scale inventories. If they are not from a protected area, such
species would automatically be 100% unprotected. Conversely, a species
with three geographically disparate records might have its range greatly
overestimated, potentially also exaggerating the percentage of its range
that is unprotected. Points included in studies such as Chowdhury et al.
(2023) lack such filters, and whilst attempts were made (using the
rangeBuilder package) such errors will prevail when data quantity is low
and filters have not been implemented to clean distributional errors.
However in a better scenario, where filters had been applied would a
convex hull be appropriate to represent current species ranges?
Given the paucity of data for most insects (and many other taxa),
different approaches have been applied to circumvent issues in existing
studies aiming to map richness across continents. These include
interpolating richness by modelling it directly as a function of
ecological drivers (termites, bees, Collembola: Liu et al., 2022; Orr et
al., 2021; Potapov et al., 2023, respectively), species-level modelling
within MCPs to delimit ranges, or modelling following the building of a
comprehensive point based database (ants: Kass et al., 2022). Even for
smaller regions, where there is high diversity, filtering suitable
habitat within a polygon provides a much more targeted and realistic
understanding of species ranges (Chesshire et al., 2023), and in all of
these studies the impact of data biases were limited either through
interpolation of richness for data poor regions, or via aggregation of
further data followed by modelling, and in all of them habitat filters
were applied to ensure that analyses were accurate. If these steps are
not taken, results could be misleading and, at times, counterproductive,
given the limited funds and support available for conservation and area
protection.