Context: I’ve downloaded up to 100 geo-referenced occurrences from GBIF for each of about ~80,000 plant species. I’m starting to check the quality of the data and I see there are some problems I’ll need to address.
e.g. for Anguloa eburnea GBIF has only one occurrence record with a lat,long and that puts it in the Yorkshire Dales (UK). Source: http://www.gbif.org/occurrence/search?TAXON_KEY=Anguloa+eburnea
DYNAMIC PROPERTIES
{"collectionkind": "Sheet", "recordtype": "Specimen - single", "gbifissue": "COORDINATE_ROUNDED;GEODETIC_DATUM_ASSUMED_WGS84", "created": "2007-09-27", "gbifid": 1056251124, "centroid": true, "subdepartment": "Flowering Plants", "determinations": {"name": ["Anguloa eburnea B.S.Williams"]}, "cultivated": "True"}
I google the species and find it’s native to Colombia, Peru & Ecuador. Source: Anguloa - Wikipedia
What scalable, reproducible strategies can I use to detect these kind of wildly erroneous occurrences?
I’m happy if it’s correct at the country-level. What other types of error should I be looking for?
I’m aware there is probably a lot already written about this so you’re welcome to just point me to some links if the best R-specific answers already exist.
I’ve read this: https://cran.rstudio.com/web/packages/rgbif/vignettes/issues_vignette.html but I’m not sure relying on gbifissues is good enough. “COORDINATE_ROUNDED” “GEODETIC_DATUM_ASSUMED_WGS84” “COUNTRY_DERIVED_FROM_COORDINATES” aren’t even that major, are they? I’ve already removed all the ZERO_COORDINATEs (although not by using gbifissues)
In the case of the dodgy Anguloa eburnea occurrence record the only gbifissues given are: “COORDINATE_ROUNDED” & “GEODETIC_DATUM_ASSUMED_WGS84” which don’t actually in themselves indicate the actual problem (or do they?).
Tracing the occurrence record back to the source (NHMUK), I see it’s from a cultivated specimen.
http://data.nhm.ac.uk/dataset/collection-specimens/resource/05ff2255-c38a-40c9-b657-4ccb55ab2feb/record/2420086
Is there any way I can filter out occurrences from cultivated specimens? I want wild-only
Thanks in advance,
Ross