rOpenSci package or resource used*
rgbif
What did you do?
A user asked about being able to query GBIF http://www.gbif.org/ using rgbif
, while constraining the search spatially using World Wildlife Fund terrestrial ecoregions.
One issue is getting the ecoregions. That’s relatively straight-forward:
library(curl)
curl::curl_download("http://assets.worldwildlife.org/publications/15/files/original/official_teow.zip?1349272619", destfile = "official.zip")
unzip("official.zip")
shpfile <- "official/wwf_terr_ecos.shp"
Then read the shp file in, here using geojsonio
library(geojsonio)
shp <- geojson_read("official/wwf_terr_ecos.shp", method = "local", what = "sp")
Then it becomes a bit tricky since GBIF does accept a way to limit your search to a specific geospatial area, but there’s a number of different paths that you can take:
GBIF Search API
This is the API route /occurrence/search/
and the functions rgbif::occ_search
and rgbif::occ_data
The problem with this route is that you are limited in the number of occurrences you can get to essentially 200K.
In addition, because this route uses GET
HTTP requests, your entire query has to go in the URL. GBIF accepts only Well-known Text (aka WKT) (a string representation of geospatial data), but this can be quite long. So you can quickly run into the max. characters for a URL, after which you get an error.
GBIF Download API
The other option is the download API, using the route /occurrence/download/
, and the functions that are prefixed with rgbif::occ_download*
These functions have a slightly different interface than rgbif::occ_search
and rgbif::occ_data
, so there’s the downside of learning a new query interface. However, plus side is that the query interface is more flexible, and you can get as much data as you like. Another down side is that you don’t get the data immediately, but rather wait for the file to be prepared. But don’t worry! We make it easy to get the file and import without leaving R.
Here’s one way to do it:
We’ll need development version of rgbif
which has some fixes we need. so install it from github like devtools::install_github("ropensci/rgbif")
, then load libraries
library(rgeos)
library(sp)
library(rgbif)
library(rmapshaper)
library(wellknown)
Just picking one region, select the region like so
ymf <- shp[which(rowSums(shp@data == "Yucatán moist forests", na.rm = TRUE) > 0), ]
Since WKT can be very verbose, and some polygons very detailed, we can reduce
the length of the WKT by reducing digits after the decimal as well as reducing
detail in general with rmapshaper
.
polys <- lapply(ymf@polygons, function(z) {
# need a SpatialPolygon to use writeWKT
spoly <- SpatialPolygons(list(z), 1L)
# simplify the SpatialPolygon
spoly_simp <- rmapshaper::ms_simplify(spoly)
# make WKT
tmp <- rgeos::writeWKT(spoly_simp)
# reduce decimal places
geojson2wkt(unclass(wkt2geojson(tmp, feature = FALSE)), fmt = 5)
})
You may not need to do this step above of simplifying the WKT when using the download API. I haven’t explored yet whether GBIF has limits on how long WKT strings can be. Worth trying it out!
Run a for loop for each polygon, and for each polygon request the data download,
then run a while loop to check the status, and then download when it’s ready
out <- list()
for (i in seq_along(polys)) {
# request data
res <- occ_download(paste0("geometry within ", polys[[i]]))
# while loop to check status
stat <- "PREPARING"
while(stat %in% c('PREPARING', 'RUNNING')) {
met <- occ_download_meta(res)
stat <- met$status
Sys.sleep(1)
}
# once ready, download and import
out[[i]] <- occ_download_get(res) %>% occ_download_import()
}
On the occ_download_get
you can optionally write to disk and then not import, which is probably what you want to do since you can’t predict how large the data will be, and you’d be reading all into memory.