GBIF downloads via rgbif - a fix and example

When you need A LOT of data from the Global Biodiversity Information Facility (GBIF) you may want to use the GBIF download web service. rgbif has an interface fro GBIF downloads in the functions prefixed with occ_download.

In a recent fix to rgbif you can now do more complex download queries that failed before the fix.

In the below example, we search for multiple taxon keys, with multiple basisOfRecord values, and with various other requirements (country, hasCoordinate, hasGeospatialIssue), as well as two range queries.

Note that whereas you can do range queries with occ_data or occ_search like

res <- occ_data(depth='50,100', limit = 20)
res$data$depth
#> [1] 81 81 81 81 81 81 81 81 81 81 81 81 81 81 66 94 65 65 65 65

Where we want records with depth values between 50 and 100 - you CAN do range queries with occ_download, but you have to separate your ranges into separate statements, like depth >= 50 and depth <= 100.

Here’s the occ_download example

res <- occ_download(
 "taxonKey = 2480946,5229208",
 "basisOfRecord = HUMAN_OBSERVATION,OBSERVATION,MACHINE_OBSERVATION",
 "country = US",
 "hasCoordinate = true",
 "hasGeospatialIssue = false",
 "year >= 1999",
 "year <= 2011",
 "month >= 3",
 "month <= 8"
)

There’s a few things to explain here.

First, instead of passing in a vector of taxon keys we need to pass in a single character vector comma separated with taxon keys. In the request sent to GBIF, it gets parsed out like:

"type": "or",
"predicates": [
  {
    "type": "equals",
    "key": "TAXON_KEY",
    "value": "2480946"
  },
  {
    "type": "equals",
    "key": "TAXON_KEY",
    "value": "5229208"
  }
]

In addition, there are multiple values of basisOfRecord, and those similarly are parsed:

"type": "or",
"predicates": [
  {
    "type": "equals",
    "key": "BASIS_OF_RECORD",
    "value": "HUMAN_OBSERVATION"
  },
  {
    "type": "equals",
    "key": "BASIS_OF_RECORD",
    "value": "OBSERVATION"
  },
  {
    "type": "equals",
    "key": "BASIS_OF_RECORD",
    "value": "MACHINE_OBSERVATION"
  }
]

The remainder of the query terms are individually sent and are more straightforward.

We can check to see when the download has succeeded like

occ_download_meta(res)
#> <<gbif download metadata>>
#>   Status: SUCCEEDED
#>   Download key: 0053344-160910150852091
#>   Created: 2017-01-19T23:46:23.536+0000
#>   Modified: 2017-01-19T23:47:03.791+0000
#>   Download link: http://api.gbif.org/v1/occurrence/download/request/0053344-160910150852091.zip
#>   Total records: 21
#>   Request: 
#>     type:  and
#>     predicates: 
#>       > type:  or
#>         predicates: 
#>           - type: equals, key: TAXON_KEY, value: 2480946
#>           - type: equals, key: TAXON_KEY, value: 5229208
#>       > type:  or
#>         predicates: 
#>           - type: equals, key: BASIS_OF_RECORD, value: HUMAN_OBSERVATION
#>           - type: equals, key: BASIS_OF_RECORD, value: OBSERVATION
#>           - type: equals, key: BASIS_OF_RECORD, value: MACHINE_OBSERVATION
#>       > type: equals, key: COUNTRY, value: US
#>       > type: equals, key: HAS_COORDINATE, value: true
#>       > type: equals, key: HAS_GEOSPATIAL_ISSUE, value: false
#>       > type: greaterThanOrEquals, key: YEAR, value: 1999
#>       > type: lessThanOrEquals, key: YEAR, value: 2011
#>       > type: greaterThanOrEquals, key: MONTH, value: 3
#>       > type: lessThanOrEquals, key: MONTH, value: 8

And we get status of the request, as well as the query terms broken out into a summary.

Then we can download the data, and import like

occ_download_get(res, overwrite = TRUE) %>% occ_download_import()
#> Download file size: 0.01 MB
#> 
#> # A tibble: 21 × 235
#>       gbifID abstract accessRights accrualMethod accrualPeriodicity accrualPolicy alternative audience available bibliographicCitation
#>        <int>    <lgl>        <lgl>         <lgl>              <lgl>         <lgl>       <lgl>    <lgl>     <lgl>                 <lgl>
#> 1  921520147       NA           NA            NA                 NA            NA          NA       NA        NA                    NA
#> 2  921518772       NA           NA            NA                 NA            NA          NA       NA        NA                    NA
#> 3  921518579       NA           NA            NA                 NA            NA          NA       NA        NA                    NA
#> 4  921514006       NA           NA            NA                 NA            NA          NA       NA        NA                    NA
#> 5  921509738       NA           NA            NA                 NA            NA          NA       NA        NA                    NA
#> 6  921509287       NA           NA            NA                 NA            NA          NA       NA        NA                    NA
#> 7  921508992       NA           NA            NA                 NA            NA          NA       NA        NA                    NA
#> 8  921508553       NA           NA            NA                 NA            NA          NA       NA        NA                    NA
#> 9  921508299       NA           NA            NA                 NA            NA          NA       NA        NA                    NA
#> 10 921508060       NA           NA            NA                 NA            NA          NA       NA        NA                    NA
#> # ... with 11 more rows, and 225 more variables: conformsTo <lgl>, contributor <lgl>, coverage <lgl>, created <lgl>, creator <lgl>, date <lgl>,
#> #   dateAccepted <lgl>, dateCopyrighted <lgl>, dateSubmitted <lgl>, description <lgl>, educationLevel <lgl>, extent <lgl>, format <lgl>,
#> #   hasFormat <lgl>, hasPart <lgl>, hasVersion <lgl>, identifier <lgl>, instructionalMethod <lgl>, isFormatOf <lgl>, isPartOf <lgl>,
#> #   isReferencedBy <lgl>, isReplacedBy <lgl>, isRequiredBy <lgl>, isVersionOf <lgl>, issued <lgl>, language <lgl>, license <chr>,
#> #   mediator <lgl>, medium <lgl>, modified <lgl>, provenance <lgl>, publisher <lgl>, references <lgl>, relation <lgl>, replaces <lgl>,
#> #   requires <lgl>, rights <lgl>, rightsHolder <lgl>, source <lgl>, spatial <lgl>, subject <lgl>, tableOfContents <lgl>, temporal <lgl>,
#> #   title <lgl>, type <lgl>, valid <lgl>, institutionID <lgl>, collectionID <lgl>, datasetID <lgl>, institutionCode <chr>, collectionCode <chr>,
#> #   datasetName <lgl>, ownerInstitutionCode <lgl>, basisOfRecord <chr>, informationWithheld <lgl>, dataGeneralizations <lgl>,
#> #   dynamicProperties <lgl>, occurrenceID <lgl>, catalogNumber <int>, recordNumber <lgl>, recordedBy <int>, individualCount <lgl>,
#> #   organismQuantity <lgl>, organismQuantityType <lgl>, sex <lgl>, lifeStage <lgl>, reproductiveCondition <lgl>, behavior <lgl>,
#> #   establishmentMeans <lgl>, occurrenceStatus <lgl>, preparations <lgl>, disposition <lgl>, associatedReferences <lgl>,
#> #   associatedSequences <lgl>, associatedTaxa <lgl>, otherCatalogNumbers <lgl>, occurrenceRemarks <lgl>, organismID <lgl>, organismName <lgl>,
#> #   organismScope <lgl>, associatedOccurrences <lgl>, associatedOrganisms <lgl>, previousIdentifications <lgl>, organismRemarks <lgl>,
#> #   materialSampleID <lgl>, eventID <lgl>, parentEventID <lgl>, fieldNumber <lgl>, eventDate <chr>, eventTime <lgl>, startDayOfYear <lgl>,
#> #   endDayOfYear <lgl>, year <int>, month <int>, day <int>, verbatimEventDate <lgl>, habitat <lgl>, samplingProtocol <lgl>,
#> #   samplingEffort <lgl>, sampleSizeValue <lgl>, ..

That’s it! :rocket:

Let me know if you run into any problems

We’ll be sending a new version of rgbif to CRAN soon with the fix mentioned above, and other fixes