Have just started using rgbif, and noticed that occ_count() and occ_search() return different numbers of records for a particular taxon key, but only when using georeferenced & hasCoordinate:
If I remove the hasCoordinate and georeferenced parameters then all is well, but I get nervous when the numbers don’t match up like this. Would appreciate any info or advice, thanks.
Thanks for your question. I don’t know the reason they differ. My guess is that GBIF may use slightly different methods for each of /count and /occurrence web services on the backend to do the pruning of points to those that have coordinates. But perhaps rgbif does something wrong - though, looking through our code I can’t find anything. I’ll ask on the GBIF mailing list about this and get back to you here…
Thanks for your replies, Scott. Was wondering if you’d noticed this before, as it doesn’t seem to be the case for all species. No big deal though, I’ll just omit the parameters and get all records, then go from there.
I heard back from GBIF quickly. Here’s their response, with a few annotations:
Two things can cause this [discrepancy]:
Eventual consistency
The count service is an insanely high throughput service, while search is lower throughput - they have different backends, and a messaging bus keeps them in sync. Because of this there is often a short period (up to 1 hr but normally < 5 mins) where they can differ during indexing runs. Issues can creep in and they drift and occasionally we rebuild the count service. The search service is always the correct one.
Geospatial issues
The isGeoreferenced [parameter] only counts records with coordinates and no known geospatial issues - i.e. records we’d consider suitable for using the coordinates.
In this case it is 2. that provides the difference, and the search service should be using the hasGeospatialIssue parameter.
Many thanks for chasing this up @sckott, and for your detailed response. That all makes perfect sense.
So is the occ_search() parameter spatialIssues intended to correspond to the API search parameter hasGeospatialIssue? I tried using spatialIssues=FALSE but it didn’t reduce the number of records returned:
It looks as though GBIF changed the parameter spatialIssues to hasGeospatialIssue. See GbifTerm (Darwin Core API 1.47-SNAPSHOT API) But in the API docs they still have spatialIssues. I’ll see if they can update that. Sorry, sometimes I don’t hear about changes they make right away.