Ways of quality-checking GBIF occurrence data at the country-level

Tags: #<Tag:0x00007fa100ba65f0> #<Tag:0x00007fa100ba6488>


Context: I’ve downloaded up to 100 geo-referenced occurrences from GBIF for each of about ~80,000 plant species. I’m starting to check the quality of the data and I see there are some problems I’ll need to address.

e.g. for Anguloa eburnea GBIF has only one occurrence record with a lat,long and that puts it in the Yorkshire Dales (UK). Source: http://www.gbif.org/occurrence/search?TAXON_KEY=Anguloa+eburnea


{"collectionkind": "Sheet", "recordtype": "Specimen - single", "gbifissue": "COORDINATE_ROUNDED;GEODETIC_DATUM_ASSUMED_WGS84", "created": "2007-09-27", "gbifid": 1056251124, "centroid": true, "subdepartment": "Flowering Plants", "determinations": {"name": ["Anguloa eburnea B.S.Williams"]}, "cultivated": "True"}

I google the species and find it’s native to Colombia, Peru & Ecuador. Source: https://en.wikipedia.org/wiki/Anguloa

What scalable, reproducible strategies can I use to detect these kind of wildly erroneous occurrences?
I’m happy if it’s correct at the country-level. What other types of error should I be looking for?

I’m aware there is probably a lot already written about this so you’re welcome to just point me to some links if the best R-specific answers already exist.

I’ve read this: https://cran.rstudio.com/web/packages/rgbif/vignettes/issues_vignette.html but I’m not sure relying on gbifissues is good enough. “COORDINATE_ROUNDED” “GEODETIC_DATUM_ASSUMED_WGS84” “COUNTRY_DERIVED_FROM_COORDINATES” aren’t even that major, are they? I’ve already removed all the ZERO_COORDINATEs (although not by using gbifissues)

In the case of the dodgy Anguloa eburnea occurrence record the only gbifissues given are: “COORDINATE_ROUNDED” & “GEODETIC_DATUM_ASSUMED_WGS84” which don’t actually in themselves indicate the actual problem (or do they?).

Tracing the occurrence record back to the source (NHMUK), I see it’s from a cultivated specimen.
Is there any way I can filter out occurrences from cultivated specimens? I want wild-only

Thanks in advance,



With a bit of csv grep for the string ‘cultivated’ (full disclosure: crude method!) I estimate just less than 0.5% (one in twenty) of the occurrences I have downloaded might be from cultivated specimens. Roughly 15,000 out of 3,000,000 records. Would be nice if I could exclude/detect these using R

Phew. Actually this particular problem isn’t as common as my crude grep makes it out to be. I examined it more closely and determined a grep for ""cultivated"": ""True"" would return all the positively cultivated records from NHMUK. There’s only 406 of those. Not 15,000. But still, an example of one of the many different issues that are probably lurking within the data!


Okay… maybe 15,000 might serve as an upper bound estimate. Cultivated is not consistently recorded in NHMUK data

e.g. http://www.gbif.org/occurrence/1265744535


{"collectionkind": "Sheet", "recordtype": "Specimen - single", "created": "2016-03-15", "centroid": false, "subdepartment": "Flowering Plants", "determinations": {"name": ["Hypericum fosteri N.Robson"]}, "cultivated": "False", "labellocality": "Cultivated: 48 Granville Road, Limpsfield, Surrey"}

A very unfortunate contradictory record that asserts cultivated : False, but also specifies it is in fact cultivated in the labellocality field.


That’s something I want to get sorted, e.g. https://github.com/ropensci/rgbif/issues/110 - Though hopefully a solution not specific to rgbif could land here https://github.com/ropenscilabs/scrubr If you have any ideas…

Right, looking for the term cultivated is one way.

One common thread between http://api.gbif.org/v1/occurrence/1265744535 and http://api.gbif.org/v1/occurrence/1056251124 is the dataset, both of which are from http://api.gbif.org/v1/dataset/7e380070-f762-11e1-a439-00145eb45e9a - which perhaps can help with this problem? I don’t know if it always or sometimes would help.

However, we can’t search on the dynamicProperties field with the GBIF API, and that field is not always present/returned in each record - looking over a large sample now to see how common that field is …

One thing I’ve been thinking given that I’m at WikiCite right now, is wonder if we can get all GBIF records on WikiData, or at least a WikiBase, then users can edit them, and submit those edits back to GBIF - I know Rod Page has probably thought about how to do this a lot already …


Interesting… I had a Twitter discussion about user-editing GBIF data with Rod not long ago. I’ll point him to this post. I would enjoy a GitHub / Wiki-style GBIF with moderated edits :slight_smile:


I’ve been told by wikipedians that there are examples of edits happening on wikidata, and those flowing back to where the canonical data is hosted elsewhere - so seems like its possible. Two things at least:

  1. GBIF has to be okay with this, obviously
  2. Wikidata folks need to allow the ~650 million records - not sure if they would or not


Hmm, this http://gbif.github.io/gbif-api/apidocs/org/gbif/api/vocabulary/BasisOfRecord.html#LIVING_SPECIMEN seemed promising LIVING_SPECIMEN - as description says

An occurrence record describing a living specimen, e.g. managed animals in a zoo or cultivated plants in a garden.





Of two records for your example taxon, one http://api.gbif.org/v1/occurrence/1056251124 is _ PRESERVED_SPECIMEN_, while the other is http://api.gbif.org/v1/occurrence/614602797 _ LIVING_SPECIMEN_, but both are cultivated, right?

In this case it turns out to be the correct country, but if you did want to check if lat/long values matched the stated country in the record, I am working on cleaning tools in scrubr, e.g., scrubr::coord_within()


Cultivated in what sense?

Genetically, particularly with the Jardin Botanique de Montréal provenance record, it could be a wild accession for all we know. I’m not entirely clear myself on the distinction between cultivated and wild accessions. Does a wild plant transplanted from wild habitat become a cultivated plant the minute it gets translocated to growing in a botanic garden?

This is particularly thorny for orchid species: records could well be either. Without definitive assertions can we tell?


I think there are two issues with Wikidata, scale and scope. I doubt it will handle 650 million records without any problems, that’s much, much bigger than Wikidata itself. I’ve never been completely clear about the scope, but I think the goal of Wikidata is to have structured data for Wikipedia articles (and related content). I suspect that Wikidata itself isn’t totally sure what the scope is (e.g., in the context of WikiCite, should Wikidata have all citations on Wikipedia pages, or ALL citations, ever?).

I think editing GBIF is a bigger discussion as there are a bunch of issues, and a number of different ways do make the data “fixable”.


Right, haven’t talked to wikipedians here, it does seem like Wikidata’s main purpose is to provide data to Wikipedia and related sites pages - I am guessing it wouldn’t be a good fit, especially given the huge number of GBIF data records


I don’t know :smile: maybe @rdmpage has some knowledge of what they mean by cultivated?