Context: The current data landscape
Biodiversity mapping is a central tenet of conservation, and it is achieved either through the use of point data or expert range maps. First came basic repositories of museum point data such as Arctos in 1996 (Jarrell et al., 2010), followed by GBIF in 2001 and OBIS for Ocean systems in 2002. More recently, a wealth of citizen science platforms have grown to provide a greater spatial, temporal, and taxonomic volume of data than has ever previously been available. Such data promised to revolutionise our understanding of global biodiversity patterns, enabling us to finally move away from the hand-drawn range maps in Birdlife, IUCN, GARD and Fishbase (all of which aggregate data and map species distributions), which themselves represented a major step forwards in providing any spatial data to map species ranges for thousands of species (Hughes et al., 2021c). Most global papers continue to use these often hand-drawn polygon maps to visualise priorities or assess gaps because they are easy to download and simple to use, despite studies highlighting substantial scale-dependence and errors of both omission and commission (Herkt et al. 2017; Li et al., 2020; Hughes et al 2021c). Growing volumes of point data provide an option of better, and more accurate species ranges analysis than polygon approaches, yet these empirical point-based data present a different challenge; how can we use these data without over-extending their limits?
As point data continue to increase in volume, so too will the number and type of biases within those data, making their reconciliation more complex. At the same time, the drive to perform large-scale mapping is increasing (Wyborn & Evans 2021). There has never before been such a dire need for these data, with the failure to complete the Aichi Biodiversity Targets, the delays in most countries completing their National Strategic Biodiversity and Action Plans (and the need for revisiting these plans), and the Monitoring Framework. Understanding where we have data, what biases exist in those data, and how they can be overcome is critical to even begin moving towards the goals of the GBF (Mace et al. 2018). Modelling and mapping has previously even been the subject of major assessments (i.e. IPBES 2018), and thus whilst there is guidance on model approaches, understanding how different forms of data, such as species distributions, can be effectively used is essential.
Understanding the dimensions of biases and their consequences
Accurate mapping of species ranges has some basic tenets that must be followed to produce reliable results (Malavasi 2020). Depending on the amount and quality of available data, increasingly sophisticated approaches can be used, but even basic methods have great potential when used carefully and appropriately. For the most basic analyses, data must still be representative, with minimal biases, but data in popular databases were largely collected from areas within 2km of roads (Hughes et al., 2021b), meaning generalist and disturbance-tolerant species will be over-represented and relatively few areas will be protected in most cases (given the higher level of development near mapped roads; Hughes 2018). These biases can be even stronger in citizen science data or social media data, and must be carefully accounted for (Bird et al., 2014; Johnston et al. 2020; Hughes et al. 2021b; Barber et al., 2022; Chowdhury et al., 2023a).
Common uses of biodiversity data include species-level (i.e. IUCN Red List) to regional and global mapping. Thus, spatial and taxonomic biases will be a particular problem with biodiversity mapping approaches which do not include point projection or extrapolation to circumvent data gaps and biases. On a species-level, approaches such as extent of occurrence and area of occurrence (EOO and AOO; which bases ranges on minimum spanning polygons and occupied cells, respectively) will disproportionately represent the most disturbed areas where most data exist, and whilst they are only sometimes calculated with available point data (unless further additional data can be collated for verification), they are a core component of IUCN range assessments. Further, without careful data cleaning, errors can obscure any patterns or trends, as common errors (switched coordinates, incorrect georeferences, misidentifications, etc.) may mislay species distribution points across different continents, even different hemispheres, and a projective approach consequently risks overestimating distributions across much of the global land-surface. For all these reasons, to assess if a dataset can be used for meaningful analysis, we first need to determine if the data are taxonomically or spatially representative, if it represents the current ranges of species, and how most data were generated, before assessing if methods used for assessing ranges can give a useful understanding of species ranges.
Globally, biases are pervasive, and representative and accessible data simply do not exist for many taxa, especially hyper-diverse and challenging taxa such as insects (Garcia-Rosello et al., 2023). Most data come from high-income countries, despite only a minority of species coming from these regions (Hughes et al., 2021; Orr et al., 2020). For example, Asian bees represent 15% of globally described species, but only 1% of data (Orr et al., 2020). Notably, problematic knowledge gaps still remain even in some of the best-known areas like Europe (Leandro et al. 2017). Ultimately, any research attempting global-scale analysis based on point data would strongly over-represent patterns in high-income economies and better-known regions within them (Orr et al., 2021), though different data-sharing policies from different countries or institutes may alter completeness of available data. Countries and regions may limit both data availability, and permitting for research, both of which can alter the availability and therefore representativeness of data, particularly in developing economies. These policies may also limit collaborative efforts, particularly between neighbouring countries, and make transnational analysis particularly challenging due to a lack of standardisation between regional data-collection efforts. Thus, equitable international partnerships are a paramount consideration for research going forward, and understanding how to reconcile differences is crucial to best use data that has already been collected.
Taking a recent example, Chowdhury et al., 2023b) aimed to assess the adequacy protection coverage of protected areas for all insects. Using their occurrence data DOI, we can see that 69% of their data came from Europe, and 21% from North America, whereas conversely each of Africa, Asia, Oceania, and South America only contributed about 2% of records each, despite having much greater diversity (Orr et al., 2020). This is consistent with other analyses across taxa, highlighting why interpolation requires extreme caution when interpreting global trends (i.e. Orr et al., 2021; Hughes et al., 2021b); a recent attempt at global bee decline analysis faced similar geographic biases and did not account for them (Zattara & Aizen 2021), and even many regional analyses (e.g Kerr et al., 2015) do not adequately account for potential changes in collector aims and behaviours over time. There are similar biases in taxonomic representation (Chowdhury et al., 2023b); of the eight genera with over one million points in GBIF, seven are butterflies, and in fact 51% of all GBIF invertebrate data are for Lepidoptera. In another example, Bolam et al. (2023) analyzed most IUCN-assessed threatened species and claimed that over half of threatened species require recovery actions, but which species have been assessed is itself biased by spatial effort for most groups, as well as only representing a subset of species (Hughes et al., 2021a), so these results may not be generalizable. This inability to generalise given the lack of spatial or taxonomic representation may be the case for many studies using subsets of total diversity for which there are sufficient data (Visconti et al. 2016; Pacifici et al. 2020). Furthermore, taxonomic expertise may be limited, and in certain taxa, reliance even on single experts may impact the reliability of results across space and time. Given acknowledged spatial biases (Hughes et al. 2021b; Rocha-Ortega et al. 2021), many of these species likely already occupy human-modified areas and would benefit little from protected areas, or might even do worse if they show anthropophilic tendencies. Furthermore, overall 80% of insect families have under 10% of species covered, and the remaining 20% have 11-13%, meaning that no generalizable conclusions can be drawn when the majority of species are not covered (because those with sufficient data do not largely represent rarer species). In terms of how records were generated, 62% were human observation (majority citizen science data) and 31% were specimens (from possibly centuries ago, as a time filter was not applied). Some specimens might even be fossils of species which are still extant (as fossil records were downloaded and only extinct fossil species were removed).
Citizen science data can also complicate matters when used uncritically, yet such data makes up most contemporary data for many groups. Taxonomic biases are amplified in these data, with birds clearly dominating (Dobson et al., 2020; Di Cecco et al., 2021), but other groups may also exhibit biases. For example, 11% of all insect data from Chowdhury et al. (2023b) was just from UK butterfly and moth citizen science monitoring programs, and a further 10% were from global citizen science programs. This would exacerbate the aforementioned biases of collections in developed areas (Hughes et al. 2021b), as seen for bees but was unaccounted for in Zattara & Aizen (2021). Citizen science data are useful, especially for phenological monitoring, but occur disproportionately in developed areas and for common and easy to recognise species, especially of “charismatic” or “beautiful” species, meaning that, without careful steps to correct for or counteract biases, these data can compound biases and not necessarily improve our knowledge (Dickinson et al. 2012; Bird et al. 2014; Ward 2014). This can also impact the ability of such data to detect trends (Kamp et al., 2016). Generally, regional differences in organisations and their taxonomic foci means that these biases may vary by region; the UK in particular has a huge aggregation of Lepidoptera data, despite not having particularly high diversity relative to other regions. This is further complicated by differences in data sharing policies. For example, iNaturalist is free to use, but the Bees, Wasps & Ants Recording Society of the UK shares data publicly only on 10 km grids and, as such, cannot be included in GBIF downloads (https://www.bwars.com/content/bwars-data-download).
These issues mean that the public data for insects are not spatially or taxonomically representative, and cannot be regarded as such, so approaches to circumvent these issues require either resampling or reprojection methods if we wish to reconstruct meaningful global, or even regional patterns (i.e. Orr et al., 2021). Given the spatial and taxonomic gaps, clear assessments of representativeness of data are needed, especially in tropical regions (Giam et al., 2012; de Araujo et al., 2022), and supplemental data may be needed for the most basic view of many taxa. Many regions have disproportionate volumes of data with different regional biases, for example Chowdhury et al., (2023b) have largely provided a metric with a disproportionate emphasis on European Lepidoptera. However, even within Europe, further assessment of the percentage of the area would be needed to judge the true degree of protection, unless additional modelling and calibration were used to reconcile the biases within these regions. Stating any result beyond these regions and taxa risks being misleading, while giving the impression that we already have sufficient data for such types of global analyses, which we simply do not for many taxa (Wyborn & Evans, 2021).
Measuring diversity in the face of biases
Following the case study outlined above, it is clear that there are right and wrong ways to carry out and interpret analyses. Whilst some shortcuts may have relatively little impact, others may invalidate interpretations when assumptions are not met. It is also worth considering the cumulative contribution of smaller biases in undermining accurate interpretation. Now we come to the methods: how can we use spatially and taxonomically biased data to recover biodiversity patterns? Avoiding biases entirely is virtually impossible in large datasets, so finding methods to minimise their impacts is critical.
Different researchers and organizations employ different methods, but consistency is important to enable comparison. One of the major methods used by the IUCN in their generation of species ranges (as part of IUCN assessments) is that of the extent of occurrence and area of occupancy (AOO, which is the occupied subset of the EOO-which is an MCP minimum convex polygon) to quantify species distributions. Such methods have also been employed in some research articles (Bradshaw et al., 2014; Chowdhury et al., 2023), sometimes with useful improvements that help alleviate biases (Kass et al., 2021). However, mapping these ranges requires a certain level of data confidence and completeness in existing distributional data, and the uses of either technique are limited if systematically collated data were not used for mapping ranges. In such cases, the AOO approach cannot be usefully applied alone because most points are from developed and disturbed areas (for many species, we would not expect the areas inhabited to be protected as they are less likely to represent high-quality habitat targeted for such protections). Thus, if an AOO based on point data is used, a null-dataset (bias weighting and assessment of representativeness) may be needed to ensure that surveyed cells are adequate, as biased sampling again means that more developed and less protected areas will have more species data. Areas of occurrence would therefore require further stratified sampling, such as grid- or transect-based approaches, or would need to rely on percentage presence approaches (the percentage of surveyed cells that a species were present in) to try to assess patterns; and may be entirely unsuitable if the degree of data coverage is too low. Furthermore, the AOO system of analysis has basic requirements to be performed well; based on inventory of species presence within the estimated range; and such a requirement is unlikely to be met in global analysis where degree of sample coverage will underestimate occurrence wherever data gaps occur.
Methods aiming to understand entire species ranges (and with inadequate data for occupancy-like approaches) may attempt to use point-based data to assess entire species ranges, but this requires extreme caution. For interpolation-based approaches, points must be an accurate reflection of species ranges, and multiple measures are needed to curate the data and ensure that they are accurate. Because of the source of data and possible encoding errors, spatial filters are needed to ensure that data are accurate. The development and cleaning of databases is an important, but challenging endeavour, but is needed to enable higher-resolution analysis. Filters could include admin-area checklists (following correction of synonyms using a curated list) or hemisphere filters (Orr et al., 2021) to remove points where coordinate errors may exist or points may be in private collections or even zoos. Chowdhury et al. (2023) used CoordinateCleaner (Zizka et al. 2019) to filter for occurrence quality (Table S1). Without clearer filters for realms or continents this still leaves considerable room for errors. Recently, two more complimentary packages, bdc (Ribeiro et al. 2022) and BeeBDC (Dorey et al. In review ; Table S1), have been released that address additional and complementary quality checks. A failure to adequately clean and filter data can substantially inflate species ranges. Additionally, even after filtering, data must be examined and analysed with a critical eye and with hypotheses kept in mind to ensure that results fall within the limits of what the data permit.
Conversely, there can be issues if a range is calculated based on a minimum number of points without spatial thinning or removal of duplicates when calculating the number. For example, Chowdhury et al. (2023) aimed to assess the percentage of ranges protected for all insects but analysed 217 species with a polygon area of “3” and 1105 with under “10” (presumedly kilometres); these low values suggest that many species ranges are underestimated, possibly reported only from small-scale inventories. If they are not from a protected area, such species would automatically be 100% unprotected. Conversely, a species with three geographically disparate records might have its range greatly overestimated, potentially also exaggerating the percentage of its range that is unprotected. Points included in studies such as Chowdhury et al. (2023) lack such filters, and whilst attempts were made (using the rangeBuilder package) such errors will prevail when data quantity is low and filters have not been implemented to clean distributional errors. However in a better scenario, where filters had been applied would a convex hull be appropriate to represent current species ranges?
Given the paucity of data for most insects (and many other taxa), different approaches have been applied to circumvent issues in existing studies aiming to map richness across continents. These include interpolating richness by modelling it directly as a function of ecological drivers (termites, bees, Collembola: Liu et al., 2022; Orr et al., 2021; Potapov et al., 2023, respectively), species-level modelling within MCPs to delimit ranges, or modelling following the building of a comprehensive point based database (ants: Kass et al., 2022). Even for smaller regions, where there is high diversity, filtering suitable habitat within a polygon provides a much more targeted and realistic understanding of species ranges (Chesshire et al., 2023), and in all of these studies the impact of data biases were limited either through interpolation of richness for data poor regions, or via aggregation of further data followed by modelling, and in all of them habitat filters were applied to ensure that analyses were accurate. If these steps are not taken, results could be misleading and, at times, counterproductive, given the limited funds and support available for conservation and area protection.