Query GBIF for occurrence records with WWF ecoregions/biomes

Tags: #<Tag:0x00007f57f611ef18> #<Tag:0x00007f57f611edb0> #<Tag:0x00007f57f611ec70> #<Tag:0x00007f57f611eb30> #<Tag:0x00007f57f611e9f0>


A user asked about being able to query GBIF http://www.gbif.org/ using rgbif, while constraining the search spatially using World Wildlife Fund terrestrial ecoregions.

One issue is getting the ecoregions. That’s relatively straight-forward:

curl::curl_download("http://assets.worldwildlife.org/publications/15/files/original/official_teow.zip?1349272619", destfile = "official.zip")
shpfile <- "official/wwf_terr_ecos.shp"

Then read the shp file in, here using geojsonio

shp <- geojson_read("official/wwf_terr_ecos.shp", method = "local", what = "sp")

Then it becomes a bit tricky since GBIF does accept a way to limit your search to a specific geospatial area, but there’s a number of different paths that you can take:


This is the API route /occurrence/search/ and the functions rgbif::occ_search and rgbif::occ_data

The problem with this route is that you are limited in the number of occurrences you can get to essentially 200K.

In addition, because this route uses GET HTTP requests, your entire query has to go in the URL. GBIF accepts only Well-known Text (aka WKT) (a string representation of geospatial data), but this can be quite long. So you can quickly run into the max. characters for a URL, after which you get an error.

GBIF Download API

The other option is the download API, using the route /occurrence/download/, and the functions that are prefixed with rgbif::occ_download*

These functions have a slightly different interface than rgbif::occ_search and rgbif::occ_data, so there’s the downside of learning a new query interface. However, plus side is that the query interface is more flexible, and you can get as much data as you like. Another down side is that you don’t get the data immediately, but rather wait for the file to be prepared. But don’t worry! We make it easy to get the file and import without leaving R.

Here’s one way to do it:

We’ll need development version of rgbif which has some fixes we need. so install it from github like devtools::install_github("ropensci/rgbif"), then load libraries


Just picking one region, select the region like so

ymf <- shp[which(rowSums(shp@data == "Yucatán moist forests", na.rm = TRUE) > 0), ] 

Since WKT can be very verbose, and some polygons very detailed, we can reduce
the length of the WKT by reducing digits after the decimal as well as reducing
detail in general with rmapshaper.

polys <- lapply(ymf@polygons, function(z) {
  # need a SpatialPolygon to use writeWKT
  spoly <- SpatialPolygons(list(z), 1L)
  # simplify the SpatialPolygon
  spoly_simp <- rmapshaper::ms_simplify(spoly)
  # make WKT
  tmp <- rgeos::writeWKT(spoly_simp)
  # reduce decimal places
  geojson2wkt(unclass(wkt2geojson(tmp, feature = FALSE)), fmt = 5)

You may not need to do this step above of simplifying the WKT when using the download API. I haven’t explored yet whether GBIF has limits on how long WKT strings can be. Worth trying it out!

Run a for loop for each polygon, and for each polygon request the data download,
then run a while loop to check the status, and then download when it’s ready

out <- list()
for (i in seq_along(polys)) {
  # request data
  res <- occ_download(paste0("geometry within ", polys[[i]]))

  # while loop to check status
  stat <- "PREPARING"
  while(stat %in% c('PREPARING', 'RUNNING')) {
    met <- occ_download_meta(res)
    stat <- met$status

  # once ready, download and import
  out[[i]] <- occ_download_get(res) %>% occ_download_import()

On the occ_download_get you can optionally write to disk and then not import, which is probably what you want to do since you can’t predict how large the data will be, and you’d be reading all into memory.