Queueing GBIF download requests

We have an R client for GBIF (Global Biodiversity Information Facility) http://www.gbif.org/ called rgbif
https://github.com/ropensci/rgbif

One set of functions in the package interacts with GBIFs download service http://www.gbif.org/developer/occurrence#download - See the rgbif functions prefixed with occ_download

One issue users may run in to is that GBIF limits each user to 3 concurrent (at the same time) download requests. You can kick off a download request with rgbif with occ_download. One solution is to manually wait for your first 3 requests to finish, then do the next 3 and so on.

BUT, this is programming, so we can do better right?

I wrote a little function that tries to help users that need to submit lots of download requests to do so without having to do each set of 3 manually.

gbif_queue <- function(...) {
  reqs <- lazyeval::lazy_dots(...)
  results <- list()
  groups <- split(reqs, ceiling(seq_along(reqs)/3))

  for (i in seq_along(groups)) {
    cat("running group of three: ", i)
    res <- lapply(groups[[i]], function(w) {
      tmp <- tryCatch(lazyeval::lazy_eval(w), error = function(e) e)
      if (inherits(tmp, "error")) {
        "http request error"
      } else {
        tmp
      }
    })

    # filter out errors
    res_noerrors <- Filter(function(x) inherits(x, "occ_download"), res)
    still_running <- TRUE
    while (still_running) {
      metas <- lapply(res_noerrors, occ_download_meta)
      status <- vapply(metas, "[[", "", "status", USE.NAMES = FALSE)
      still_running <- !all(tolower(status) %in% c('succeeded', 'killed'))
      Sys.sleep(2)
    }
    results[[i]] <- res
  }

  results <- unlist(results, recursive = FALSE)

  return(results)
}

Let’s use it:

library(rgbif)
library(lazyeval)

output <- gbif_queue(
  occ_download('taxonKey = 3119195', "year = 1976"),
  occ_download('taxonKey = 3119195', "year = 2001"),
  occ_download('taxonKey = 3119195', "year = 2001", "month <= 8"),
  occ_download('taxonKey = 5229208', "year = 2011"),
  occ_download('taxonKey = 2480946', "year = 2015"),
  occ_download("country = NZ", "year = 1999", "month = 3"),
  occ_download("catalogNumber = Bird.27847588", "year = 1998", "month = 2")
)

The object output is a list of occ_download class objects, that returned from the function occ_download. The output means that all those requests you submitted are ready, and you can download them. You can pass all through a lapply like thing, e.g.,

lapply(output, occ_download_get)

The gbif_queue function is bound to be buggy and may need features (e.g., optionally download requests to your machine once they are ready within the function)

Thoughts? Does it work for you?