When you need A LOT of data from the Global Biodiversity Information Facility (GBIF) you may want to use the GBIF download web service. rgbif
has an interface fro GBIF downloads in the functions prefixed with occ_download
.
In a recent fix to rgbif
you can now do more complex download queries that failed before the fix.
In the below example, we search for multiple taxon keys, with multiple basisOfRecord
values, and with various other requirements (country
, hasCoordinate
, hasGeospatialIssue
), as well as two range queries.
Note that whereas you can do range queries with occ_data
or occ_search
like
res <- occ_data(depth='50,100', limit = 20)
res$data$depth
#> [1] 81 81 81 81 81 81 81 81 81 81 81 81 81 81 66 94 65 65 65 65
Where we want records with depth values between 50 and 100 - you CAN do range queries with occ_download
, but you have to separate your ranges into separate statements, like depth >= 50
and depth <= 100
.
Here’s the occ_download
example
res <- occ_download(
"taxonKey = 2480946,5229208",
"basisOfRecord = HUMAN_OBSERVATION,OBSERVATION,MACHINE_OBSERVATION",
"country = US",
"hasCoordinate = true",
"hasGeospatialIssue = false",
"year >= 1999",
"year <= 2011",
"month >= 3",
"month <= 8"
)
There’s a few things to explain here.
First, instead of passing in a vector of taxon keys we need to pass in a single character vector comma separated with taxon keys. In the request sent to GBIF, it gets parsed out like:
"type": "or",
"predicates": [
{
"type": "equals",
"key": "TAXON_KEY",
"value": "2480946"
},
{
"type": "equals",
"key": "TAXON_KEY",
"value": "5229208"
}
]
In addition, there are multiple values of basisOfRecord
, and those similarly are parsed:
"type": "or",
"predicates": [
{
"type": "equals",
"key": "BASIS_OF_RECORD",
"value": "HUMAN_OBSERVATION"
},
{
"type": "equals",
"key": "BASIS_OF_RECORD",
"value": "OBSERVATION"
},
{
"type": "equals",
"key": "BASIS_OF_RECORD",
"value": "MACHINE_OBSERVATION"
}
]
The remainder of the query terms are individually sent and are more straightforward.
We can check to see when the download has succeeded like
occ_download_meta(res)
#> <<gbif download metadata>>
#> Status: SUCCEEDED
#> Download key: 0053344-160910150852091
#> Created: 2017-01-19T23:46:23.536+0000
#> Modified: 2017-01-19T23:47:03.791+0000
#> Download link: http://api.gbif.org/v1/occurrence/download/request/0053344-160910150852091.zip
#> Total records: 21
#> Request:
#> type: and
#> predicates:
#> > type: or
#> predicates:
#> - type: equals, key: TAXON_KEY, value: 2480946
#> - type: equals, key: TAXON_KEY, value: 5229208
#> > type: or
#> predicates:
#> - type: equals, key: BASIS_OF_RECORD, value: HUMAN_OBSERVATION
#> - type: equals, key: BASIS_OF_RECORD, value: OBSERVATION
#> - type: equals, key: BASIS_OF_RECORD, value: MACHINE_OBSERVATION
#> > type: equals, key: COUNTRY, value: US
#> > type: equals, key: HAS_COORDINATE, value: true
#> > type: equals, key: HAS_GEOSPATIAL_ISSUE, value: false
#> > type: greaterThanOrEquals, key: YEAR, value: 1999
#> > type: lessThanOrEquals, key: YEAR, value: 2011
#> > type: greaterThanOrEquals, key: MONTH, value: 3
#> > type: lessThanOrEquals, key: MONTH, value: 8
And we get status of the request, as well as the query terms broken out into a summary.
Then we can download the data, and import like
occ_download_get(res, overwrite = TRUE) %>% occ_download_import()
#> Download file size: 0.01 MB
#>
#> # A tibble: 21 × 235
#> gbifID abstract accessRights accrualMethod accrualPeriodicity accrualPolicy alternative audience available bibliographicCitation
#> <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 921520147 NA NA NA NA NA NA NA NA NA
#> 2 921518772 NA NA NA NA NA NA NA NA NA
#> 3 921518579 NA NA NA NA NA NA NA NA NA
#> 4 921514006 NA NA NA NA NA NA NA NA NA
#> 5 921509738 NA NA NA NA NA NA NA NA NA
#> 6 921509287 NA NA NA NA NA NA NA NA NA
#> 7 921508992 NA NA NA NA NA NA NA NA NA
#> 8 921508553 NA NA NA NA NA NA NA NA NA
#> 9 921508299 NA NA NA NA NA NA NA NA NA
#> 10 921508060 NA NA NA NA NA NA NA NA NA
#> # ... with 11 more rows, and 225 more variables: conformsTo <lgl>, contributor <lgl>, coverage <lgl>, created <lgl>, creator <lgl>, date <lgl>,
#> # dateAccepted <lgl>, dateCopyrighted <lgl>, dateSubmitted <lgl>, description <lgl>, educationLevel <lgl>, extent <lgl>, format <lgl>,
#> # hasFormat <lgl>, hasPart <lgl>, hasVersion <lgl>, identifier <lgl>, instructionalMethod <lgl>, isFormatOf <lgl>, isPartOf <lgl>,
#> # isReferencedBy <lgl>, isReplacedBy <lgl>, isRequiredBy <lgl>, isVersionOf <lgl>, issued <lgl>, language <lgl>, license <chr>,
#> # mediator <lgl>, medium <lgl>, modified <lgl>, provenance <lgl>, publisher <lgl>, references <lgl>, relation <lgl>, replaces <lgl>,
#> # requires <lgl>, rights <lgl>, rightsHolder <lgl>, source <lgl>, spatial <lgl>, subject <lgl>, tableOfContents <lgl>, temporal <lgl>,
#> # title <lgl>, type <lgl>, valid <lgl>, institutionID <lgl>, collectionID <lgl>, datasetID <lgl>, institutionCode <chr>, collectionCode <chr>,
#> # datasetName <lgl>, ownerInstitutionCode <lgl>, basisOfRecord <chr>, informationWithheld <lgl>, dataGeneralizations <lgl>,
#> # dynamicProperties <lgl>, occurrenceID <lgl>, catalogNumber <int>, recordNumber <lgl>, recordedBy <int>, individualCount <lgl>,
#> # organismQuantity <lgl>, organismQuantityType <lgl>, sex <lgl>, lifeStage <lgl>, reproductiveCondition <lgl>, behavior <lgl>,
#> # establishmentMeans <lgl>, occurrenceStatus <lgl>, preparations <lgl>, disposition <lgl>, associatedReferences <lgl>,
#> # associatedSequences <lgl>, associatedTaxa <lgl>, otherCatalogNumbers <lgl>, occurrenceRemarks <lgl>, organismID <lgl>, organismName <lgl>,
#> # organismScope <lgl>, associatedOccurrences <lgl>, associatedOrganisms <lgl>, previousIdentifications <lgl>, organismRemarks <lgl>,
#> # materialSampleID <lgl>, eventID <lgl>, parentEventID <lgl>, fieldNumber <lgl>, eventDate <chr>, eventTime <lgl>, startDayOfYear <lgl>,
#> # endDayOfYear <lgl>, year <int>, month <int>, day <int>, verbatimEventDate <lgl>, habitat <lgl>, samplingProtocol <lgl>,
#> # samplingEffort <lgl>, sampleSizeValue <lgl>, ..
That’s it!
Let me know if you run into any problems
We’ll be sending a new version of rgbif
to CRAN soon with the fix mentioned above, and other fixes