Ways of quality-checking GBIF occurrence data at the country-level

ross_mounce · May 25, 2016, 1:43pm

Context: I’ve downloaded up to 100 geo-referenced occurrences from GBIF for each of about ~80,000 plant species. I’m starting to check the quality of the data and I see there are some problems I’ll need to address.

e.g. for Anguloa eburnea GBIF has only one occurrence record with a lat,long and that puts it in the Yorkshire Dales (UK). Source: http://www.gbif.org/occurrence/search?TAXON_KEY=Anguloa+eburnea

DYNAMIC PROPERTIES

{"collectionkind": "Sheet", "recordtype": "Specimen - single", "gbifissue": "COORDINATE_ROUNDED;GEODETIC_DATUM_ASSUMED_WGS84", "created": "2007-09-27", "gbifid": 1056251124, "centroid": true, "subdepartment": "Flowering Plants", "determinations": {"name": ["Anguloa eburnea B.S.Williams"]}, "cultivated": "True"}

I google the species and find it’s native to Colombia, Peru & Ecuador. Source: Anguloa - Wikipedia

What scalable, reproducible strategies can I use to detect these kind of wildly erroneous occurrences?
I’m happy if it’s correct at the country-level. What other types of error should I be looking for?

I’m aware there is probably a lot already written about this so you’re welcome to just point me to some links if the best R-specific answers already exist.

I’ve read this: https://cran.rstudio.com/web/packages/rgbif/vignettes/issues_vignette.html but I’m not sure relying on gbifissues is good enough. “COORDINATE_ROUNDED” “GEODETIC_DATUM_ASSUMED_WGS84” “COUNTRY_DERIVED_FROM_COORDINATES” aren’t even that major, are they? I’ve already removed all the ZERO_COORDINATEs (although not by using gbifissues)

In the case of the dodgy Anguloa eburnea occurrence record the only gbifissues given are: “COORDINATE_ROUNDED” & “GEODETIC_DATUM_ASSUMED_WGS84” which don’t actually in themselves indicate the actual problem (or do they?).

Tracing the occurrence record back to the source (NHMUK), I see it’s from a cultivated specimen.
http://data.nhm.ac.uk/dataset/collection-specimens/resource/05ff2255-c38a-40c9-b657-4ccb55ab2feb/record/2420086
Is there any way I can filter out occurrences from cultivated specimens? I want wild-only

Thanks in advance,

Ross

ross_mounce · May 25, 2016, 2:32pm

With a bit of csv grep for the string ‘cultivated’ (full disclosure: crude method!) I estimate just less than 0.5% (one in twenty) of the occurrences I have downloaded might be from cultivated specimens. Roughly 15,000 out of 3,000,000 records. Would be nice if I could exclude/detect these using R

Phew. Actually this particular problem isn’t as common as my crude grep makes it out to be. I examined it more closely and determined a grep for ""cultivated"": ""True"" would return all the positively cultivated records from NHMUK. There’s only 406 of those. Not 15,000. But still, an example of one of the many different issues that are probably lurking within the data!

ross_mounce · May 25, 2016, 3:30pm

Okay… maybe 15,000 might serve as an upper bound estimate. Cultivated is not consistently recorded in NHMUK data

e.g. http://www.gbif.org/occurrence/1265744535

DYNAMIC PROPERTIES

{"collectionkind": "Sheet", "recordtype": "Specimen - single", "created": "2016-03-15", "centroid": false, "subdepartment": "Flowering Plants", "determinations": {"name": ["Hypericum fosteri N.Robson"]}, "cultivated": "False", "labellocality": "Cultivated: 48 Granville Road, Limpsfield, Surrey"}

A very unfortunate contradictory record that asserts cultivated : False, but also specifies it is in fact cultivated in the labellocality field.

sckott · May 26, 2016, 7:10am

That’s something I want to get sorted, e.g. Possible to remove occurrences from botanical gardens? · Issue #110 · ropensci/rgbif · GitHub - Though hopefully a solution not specific to rgbif could land here GitHub - ropensci-archive/scrubr: ⚠ ARCHIVED Clean species occurrence records If you have any ideas…

Right, looking for the term cultivated is one way.

One common thread between http://api.gbif.org/v1/occurrence/1265744535 and http://api.gbif.org/v1/occurrence/1056251124 is the dataset, both of which are from http://api.gbif.org/v1/dataset/7e380070-f762-11e1-a439-00145eb45e9a - which perhaps can help with this problem? I don’t know if it always or sometimes would help.

However, we can’t search on the dynamicProperties field with the GBIF API, and that field is not always present/returned in each record - looking over a large sample now to see how common that field is …

One thing I’ve been thinking given that I’m at WikiCite right now, is wonder if we can get all GBIF records on WikiData, or at least a WikiBase, then users can edit them, and submit those edits back to GBIF - I know Rod Page has probably thought about how to do this a lot already …

rmounce · May 26, 2016, 7:19am

Interesting… I had a Twitter discussion about user-editing GBIF data with Rod not long ago. I’ll point him to this post. I would enjoy a GitHub / Wiki-style GBIF with moderated edits

sckott · May 26, 2016, 7:33am

I’ve been told by wikipedians that there are examples of edits happening on wikidata, and those flowing back to where the canonical data is hosted elsewhere - so seems like its possible. Two things at least:

GBIF has to be okay with this, obviously
Wikidata folks need to allow the ~650 million records - not sure if they would or not

sckott · May 26, 2016, 7:50am

Hmm, this BasisOfRecord (GBIF Common :: API 1.12.11 API) seemed promising LIVING_SPECIMEN - as description says

An occurrence record describing a living specimen, e.g. managed animals in a zoo or cultivated plants in a garden.

but

http://api.gbif.org/v1/occurrence/search?basisOfRecord=LIVING_SPECIMEN&taxonKey=2812559

and

http://api.gbif.org/v1/occurrence/search?basisOfRecord=PRESERVED_SPECIMEN&taxonKey=2812559

Of two records for your example taxon, one http://api.gbif.org/v1/occurrence/1056251124 is _ PRESERVED_SPECIMEN_, while the other is http://api.gbif.org/v1/occurrence/614602797 _ LIVING_SPECIMEN_, but both are cultivated, right?

In this case it turns out to be the correct country, but if you did want to check if lat/long values matched the stated country in the record, I am working on cleaning tools in scrubr, e.g., scrubr::coord_within()

rmounce · May 26, 2016, 8:06am

Cultivated in what sense?

Genetically, particularly with the Jardin Botanique de MontrÃ©al provenance record, it could be a wild accession for all we know. I’m not entirely clear myself on the distinction between cultivated and wild accessions. Does a wild plant transplanted from wild habitat become a cultivated plant the minute it gets translocated to growing in a botanic garden?

This is particularly thorny for orchid species: records could well be either. Without definitive assertions can we tell?

rdmpage · May 26, 2016, 8:19am

I think there are two issues with Wikidata, scale and scope. I doubt it will handle 650 million records without any problems, that’s much, much bigger than Wikidata itself. I’ve never been completely clear about the scope, but I think the goal of Wikidata is to have structured data for Wikipedia articles (and related content). I suspect that Wikidata itself isn’t totally sure what the scope is (e.g., in the context of WikiCite, should Wikidata have all citations on Wikipedia pages, or ALL citations, ever?).

I think editing GBIF is a bigger discussion as there are a bunch of issues, and a number of different ways do make the data “fixable”.

sckott · May 26, 2016, 12:19pm

Right, haven’t talked to wikipedians here, it does seem like Wikidata’s main purpose is to provide data to Wikipedia and related sites pages - I am guessing it wouldn’t be a good fit, especially given the huge number of GBIF data records

sckott · May 26, 2016, 12:21pm

I don’t know maybe @rdmpage has some knowledge of what they mean by cultivated?

Topic		Replies	Views
Query GBIF for occurrence records with WWF ecoregions/biomes UseCases geospatial , r , rgbif , wkt , geojson	13	2610	June 1, 2020
Retrieve data by Kingdom from iDigBio using spocc Package Use Questions	2	994	July 25, 2016
rgbif: get a little or alot of occurrence data UseCases rgbif , gbif , community-call	0	1517	April 2, 2019
Rgbif: occ_count() and occ_search() results differ Package Use Questions	12	2193	April 10, 2015
Some thoughts on working with rgbif occurrence data, including mapping Package Use Questions r , rgbif , maps	5	3315	April 4, 2018

Ways of quality-checking GBIF occurrence data at the country-level

Related topics