Query GBIF for occurrence records with WWF ecoregions/biomes

rOpenSci package or resource used*

rgbif

What did you do?

A user asked about being able to query GBIF http://www.gbif.org/ using rgbif, while constraining the search spatially using World Wildlife Fund terrestrial ecoregions.

One issue is getting the ecoregions. That’s relatively straight-forward:

library(curl)
curl::curl_download("http://assets.worldwildlife.org/publications/15/files/original/official_teow.zip?1349272619", destfile = "official.zip")
unzip("official.zip")
shpfile <- "official/wwf_terr_ecos.shp"

Then read the shp file in, here using geojsonio

library(geojsonio)
shp <- geojson_read("official/wwf_terr_ecos.shp", method = "local", what = "sp")

Then it becomes a bit tricky since GBIF does accept a way to limit your search to a specific geospatial area, but there’s a number of different paths that you can take:

GBIF Search API

This is the API route /occurrence/search/ and the functions rgbif::occ_search and rgbif::occ_data

The problem with this route is that you are limited in the number of occurrences you can get to essentially 200K.

In addition, because this route uses GET HTTP requests, your entire query has to go in the URL. GBIF accepts only Well-known Text (aka WKT) (a string representation of geospatial data), but this can be quite long. So you can quickly run into the max. characters for a URL, after which you get an error.

GBIF Download API

The other option is the download API, using the route /occurrence/download/, and the functions that are prefixed with rgbif::occ_download*

These functions have a slightly different interface than rgbif::occ_search and rgbif::occ_data, so there’s the downside of learning a new query interface. However, plus side is that the query interface is more flexible, and you can get as much data as you like. Another down side is that you don’t get the data immediately, but rather wait for the file to be prepared. But don’t worry! We make it easy to get the file and import without leaving R.

Here’s one way to do it:

We’ll need development version of rgbif which has some fixes we need. so install it from github like devtools::install_github("ropensci/rgbif"), then load libraries

library(rgeos)
library(sp)
library(rgbif)
library(rmapshaper)
library(wellknown)

Just picking one region, select the region like so

ymf <- shp[which(rowSums(shp@data == "Yucatán moist forests", na.rm = TRUE) > 0), ] 

Since WKT can be very verbose, and some polygons very detailed, we can reduce
the length of the WKT by reducing digits after the decimal as well as reducing
detail in general with rmapshaper.

polys <- lapply(ymf@polygons, function(z) {
  # need a SpatialPolygon to use writeWKT
  spoly <- SpatialPolygons(list(z), 1L)
  # simplify the SpatialPolygon
  spoly_simp <- rmapshaper::ms_simplify(spoly)
  # make WKT
  tmp <- rgeos::writeWKT(spoly_simp)
  # reduce decimal places
  geojson2wkt(unclass(wkt2geojson(tmp, feature = FALSE)), fmt = 5)
})

You may not need to do this step above of simplifying the WKT when using the download API. I haven’t explored yet whether GBIF has limits on how long WKT strings can be. Worth trying it out!

Run a for loop for each polygon, and for each polygon request the data download,
then run a while loop to check the status, and then download when it’s ready

out <- list()
for (i in seq_along(polys)) {
  # request data
  res <- occ_download(paste0("geometry within ", polys[[i]]))

  # while loop to check status
  stat <- "PREPARING"
  while(stat %in% c('PREPARING', 'RUNNING')) {
    met <- occ_download_meta(res)
    stat <- met$status
    Sys.sleep(1)
  }

  # once ready, download and import
  out[[i]] <- occ_download_get(res) %>% occ_download_import()
}

On the occ_download_get you can optionally write to disk and then not import, which is probably what you want to do since you can’t predict how large the data will be, and you’d be reading all into memory.

This post has been super helpful for my work. Thanks! I have a related question. I would like to query GBIF while constraining the search spatially using TDWG level 4 regions.

I have made a list of WKT polygons. I ran a loop for each polygon with my search terms.

I’m running into a problem: when one polygon has no data, the queue stops running completely.

Error in FUN(X[[i]], …) : !is.null(key) is not TRUE

Is there a way to “skip” over polygons with no data and continue on with the query? My code:

queries ← list()
for (i in seq_along(polys)) {
queries[[i]] ←
occ_download_prep(
pred(“taxonKey”, 7707728),
pred_in(“basisOfRecord”, c(“PRESERVED_SPECIMEN”,“HUMAN_OBSERVATION”,“OBSERVATION”,“MACHINE_OBSERVATION”)),
pred(“geometry”, polys[[i]]),
pred(“hasCoordinate”, TRUE),
pred(“hasGeospatialIssue”, FALSE),
pred_gte(“year”, 1900),
format = “SIMPLE_CSV”,
user = user,
pwd = pwd,
email = email
)
}

First time posting so apologies for any errors.

Thanks for your question @rkirsten

It’s hard to pin-point exactly where th error is coming from given your above code. Can you share a reproducible bit of code that makes that error?

Thanks for the response @sckott. I can definitely share a reproducible example but before I do, maybe I’ll ask the overarching question in a better way.

I’m looking to retrieve global species distributions (just lat long) for Tracheophyta taxon id 7707728. I know it’s a big job. I’ve tried a few different ways of doing this but maybe I’ll just ask, how would you go about doing this in an efficient way that doesn’t overload the GBIF server?

I’ll post a reproducible example and troubleshoot this particular query if you’d rather go that route.

Thanks!

@rkirsten Is this helpful at all? I realize it’s not exactly the same question rgbif: get a little or alot of occurrence data

1 Like

Thanks @stefanie! It helps a little but I’m trying to do the opposite: instead of a list of species I have a list of regions (polygons). I want to query GBIF for each region to get a list of species in phylum Tracheophyta occurring in that region.

1 Like

I’d do what you are doing above in your example code. We just need to sort out what’s going wrong. Share a reproducible example and we can get you going

2 Likes

Here’s a reproducible example of what I’m doing. I am using ecoregions2017 from the Nature Needs Half folks for this example.

#first get ecoregions

curl::curl_download(“https://storage.googleapis.com/teow2016/Ecoregions2017.zip”, destfile = “Ecoregions2017.zip”)
unzip(“Ecoregions2017.zip”)
Ecoregshp ← readOGR(dsn = “Ecoregions2017.shp”)

#create wkt polygons and reducing detail of them

Ecopolys ← lapply(Ecoregshp@polygons, function(z) {
#need a SpatialPolygon to use writeWKT
spoly ← SpatialPolygons(list(z), 1L)
#make WKT
tmp ← rgeos::writeWKT(spoly)
#reduce decimal places
geojson2wkt(unclass(wkt2geojson(tmp, feature = FALSE)), fmt = 5)
})

#user name and info for gbif inquiries

user = “yourusername”
pwd = “yourpassword”
email = “youremail”

#take a subset for sake of time

Ecopolys.example ← Ecopolys[1:10]

#allow for a longer timeout

library(httr)
GET(“api.gbif.org”, timeout(20))

#using pre-prepared requests via .list

queries ← list()
for (i in seq_along(Ecopolys.example)) {
queries[[i]] ←
occ_download_prep(
pred(“taxonKey”, 7707728),
pred_in(“basisOfRecord”, c(“PRESERVED_SPECIMEN”,“HUMAN_OBSERVATION”,“OBSERVATION”,“MACHINE_OBSERVATION”)),
pred(“geometry”, Ecopolys.example[[i]]),
pred(“hasCoordinate”, TRUE),
pred(“hasGeospatialIssue”, FALSE),
pred_gte(“year”, 1900),
format = “SIMPLE_CSV”,
user = user,
pwd = pwd,
email = email
)
}

#download those queries

out.example ← occ_download_queue(.list = queries)

#import the downloads

import ← list()
for (i in 1:length(out.example)) {
out.df ← out.example[[i]]
get.out ← occ_download_get(out.df, overwrite = TRUE)
import[[i]] ← occ_download_import(get.out)
}

I’m getting the error message after running 6

3 requests, waiting for completion
0048936-200221144449610: succeeded
running 4 of 10
running 5 of 10
0048935-200221144449610: succeeded
running 6 of 10
Error in FUN(X[[i]], …) : !is.null(key) is not TRUE

Any help with this would be much appreciated!! Thanks. Again, apologies for any errors here (I’m still new!)

1 Like

Thanks @rkirsten - I’ll have a look and get back to you soon

1 Like

@rkirsten Try again after restarting R and reinstalling remotes::install_github("ropensci/rgbif") or install.packages("rgbif", repos = "https://dev.ropensci.org")

was a problem with some internal job status checking, should be fixed now

2 Likes

@sckott Thanks! Got it! :grinning:

New error message after attempting import (bit and bit64 were downloaded). Shall I post it here or start a new thread?

You can pop it in here …

1 Like

(Using the same example from above) After downloading the requests, I use the following code to import:

#import the downloads
import ← list()
for (i in 1:length(out.example)) {
out.df ← out.example[[i]]
get.out ← occ_download_get(out.df, overwrite = TRUE)
import[[i]] ← occ_download_import(get.out)
}

And after a few complete, I get an error message:

Download file size: 0 MB
On disk at ./0075134-200221144449610.zip
Download file size: 0.07 MB
On disk at ./0075135-200221144449610.zip
Download file size: 0.26 MB
On disk at ./0075136-200221144449610.zip
Error in occ_download_meta(key) : !is.null(key) is not TRUE

The error message isn’t very explanatory so I’m not sure what’s going on there.

Thanks for any help!!

Had a look. The out.example object must have some elements that are not valid inputs for occ_download_get(). If you inspect out.example I imagine you’ll maybe find one or more of the elements in the list that are empty or similar. You can simply remove those before passing into your for loop to download and import results. Let me know what you find.

1 Like