Some thoughts on working with rgbif occurrence data, including mapping

Tags: #<Tag:0x00007fbc5852e8d0> #<Tag:0x00007fbc5852e6a0> #<Tag:0x00007fbc5852e4e8>


An rgbif user wrote with some questions about using the package. The following are the questions:


The user had a genus Myosotis that they were working with and ideally wanted to just get occurrences of the genus across all it’s species. They ended up getting GBIF taxon keys for each species, but wondered if there was a way to get occurrence data without havint to first get GIBF taxon keys for each taxon.

There’s a couple of different ways to go.

First, you can use the higher taxonomic name (e.g., genus) to fetch data rather than fetching data for each lower taxonomic name. Caveat being that I am not positive you’ll get the same results. And using the higher taxonomic name only works if you’re sure you want data for all the lower taxonomic names (e.g,. all the species in a genus). So with this approach you can do something like

# get the key for Myosotis
key <- name_backbone('Myosotis')$usageKey
# let's see how many occurrences GBIF has for that key
occ_count(taxonKey = key)
#> [1] 694310
# search for ocurrence data
(x <- occ_data(taxonKey = key, limit = 3))
#> Records found [694310]
#> Records returned [3]
#> Args [limit=3, offset=0, taxonKey=2925668]
#> # A tibble: 3 x 64
#>    name          key decimalLatitude decimalLongitude issues   datasetKey
#>    <chr>       <int>           <dbl>            <dbl> <chr>    <chr>
#>  1 Myosotis…  1.81e9           -46.5            170.  cdround… 50c9509d-22c7-4…
#>  2 Myosotis…  1.81e9            38.0           -123.  cdround… 50c9509d-22c7-4…
#>  3 Myosotis…  1.81e9           -44.9            170.  cdround… 50c9509d-22c7-4…
#> # ... with 20 more rows, and 58 more variables: publishingOrgKey <chr>,
#> #   publishingCountry <chr>, protocol <chr>, lastCrawled <chr>,
#> #  cutoff for brevity
# all names are species within the genus (don't know if these are all good names or not)
#> [1] "Myosotis rakiura"           "Myosotis latifolia"  "Myosotis australis"

Second, you can get lower taxonomic names using functions in the rgbif package e.g.,

res <- name_lookup(higherTaxonKey=2925668, limit = 300, status = "accepted")
#> # A tibble: 250 x 36
#>        key scientificName  datasetKey  constituentKey  parentKey parent kingdom
#>      <int> <chr>           <chr>       <chr>               <int> <chr>  <chr>
#>  1 7930390 Myosotis rehst… d7dddbf4-2… 7ddf754f-d193-…   2925668 Myoso… Plantae
#>  2 5660749 Myosotis saxos… d7dddbf4-2… 7ddf754f-d193-…   2925668 Myoso… Plantae
#>  3 5660801 Myosotis pusil… d7dddbf4-2… 7ddf754f-d193-…   2925668 Myoso… Plantae
#>  4 5661619 Myosotis ambig… d7dddbf4-2… 7ddf754f-d193-…   2925668 Myoso… Plantae
#>  5 8075125 Myosotis ergak… d7dddbf4-2… 7ddf754f-d193-…   2925668 Myoso… Plantae
#>  6 8166715 Myosotis macra… d7dddbf4-2… 7ddf754f-d193-…   2925668 Myoso… Plantae
#>  7 7626464 Myosotis spelu… d7dddbf4-2… 7ddf754f-d193-…   2925668 Myoso… Plantae
#>  8 5341177 Myosotis disco… d7dddbf4-2… 7ddf754f-d193-…   2925668 Myoso… Plantae
#>  9 5660980 Myosotis marit… d7dddbf4-2… 7ddf754f-d193-…   2925668 Myoso… Plantae
#> 10 5661200 Myosotis heter… d7dddbf4-2… 7ddf754f-d193-…   2925668 Myoso… Plantae
#> # ... with 240 more rows, and 29 more variables: phylum <chr>, order <chr>,
#> #   family <chr>, genus <chr>, species <chr>, kingdomKey <int>,
#> #   phylumKey <int>, classKey <int>, orderKey <int>, familyKey <int>,
#> #   genusKey <int>, speciesKey <int>, canonicalName <chr>, authorship <chr>,
#> #   publishedIn <chr>, nameType <chr>, taxonomicStatus <chr>, rank <chr>,
#> #   origin <chr>, numDescendants <int>, numOccurrences <int>, habitats <chr>,
#> #   nomenclaturalStatus <lgl>, threatStatuses <chr>, synonym <lgl>,
#> #   class <chr>, nubKey <int>, basionymKey <int>, basionym <chr>

and then use those key’s to get occurence data for those specific taxa

Third, you can use taxize package which has some tools to help you get taxonomic names specifically, e.g.,

# upgrade from github cause there's a bug fix i just pushed 
spp <- downstream("Myosotis", db = "gbif", downto = "species", limit = 300)
#>                     name    rank     key     name_type
#> 1    Myosotis abyssinica species 5660280 canonicalname
#> 2       Myosotis acutata species 5660278 canonicalname
#> 3 Myosotis afropalustris species 5661736 canonicalname
#> 4       Myosotis agnetis species 5661737 canonicalname
#> 5 Myosotis albomarginata species 5661641 canonicalname
#> 6   Myosotis albosericea species 5661639 canonicalname

Mapping gbif data

The user asked about usage of rgbif::gbifmap(), and why so many points are removed during the mapping process. Most plotting stuff is beyond my knowledge since I don’t do much of it, so there is that caveat.

There is a function rgbif::gbifmap() that helps make a simple map with the output of both rgbif::occ_search() or rgbif::occ_data(). It does make a plot but I strongly recommend using other tools.

rgbif::gbifmap() makes a point plot, and before plotting removes impossible locations and missing lat/long data. It plots points with different colors for each taxon, so if you have more than maybe 9 or 10 taxa it becomes kinda silly to have a different color for each taxon because it’s hard to discern colors when there’s so many. Plus the color scale we use scale_color_brewer has a max of 9 different colors.

splist <- c('Myosotis scorpioides','Myosotis arvensis','Myosotis laxa',
    'Myosotis ramosissima','Myosotis sylvatica','Myosotis discolor',
    'Myosotis stricta','Myosotis alpestris','Myosotis secunda',
    'Myosotis nemorosa','Myosotis ramosissima','Myosotis decumbens',
    'Myosotis australis','Myosotis lithuanica','Myosotis sparsiflora',
    'Myosotis verna','Myosotis lamottiana','Myosotis baltica',
    'Myosotis sicula','Myosotis latifolia','Myosotis stolonifera',
    'Myosotis forsteri','Myosotis macrosperma','Myosotis pygmaea',
    'Myosotis asiatica','Myosotis incrassata','Myosotis drucei',
    'Myosotis welwitschii','Myosotis debilis','Myosotis pusilla')
keys <- vapply(splist, function(x) name_suggest(x)$key[1], 
    numeric(1), USE.NAMES=FALSE)
dat2 <- occ_data(keys)
dd <- data.table::rbindlist(lapply(dat2, function(z) z$data), 
    fill = TRUE, use.names = TRUE)

gives some warnings

Rendering map...plotting 9187 points
Warning messages:
1: In RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Set1 is 9
Returning the palette you asked for with that many colors

2: Removed 5916 rows containing missing values (geom_point).

So geom_point is removing some points, not sure what is going on there, if anyone can suggest a fix? Although perhaps someday we should remove gbifmap() as mapr is a better thing for mapping

Using the mapr package we can make this work better:


there’s also options for in mapr for leaflet, github gists, and base plots


Thanks for this. I still don’t understand why only a small subset of the records are mapping in the above scripts and would appreciate any input on that. So I tried a different way. I found this website extremely helpful:

Here is a script that worked for me to get a basic map (1 colour only, and up to a sample of 75,000 records)

# First download the data directly as a .csv file from and store it in your working directly
myosotis_gbif <- readr::read_delim("0023599-180131172636756.csv", delim = "\t", escape_double = FALSE, col_names = TRUE, na = c("", "NA"))
#[1] 694310     44

#creating an occurence table
myosotis_occurrence <- myosotis_gbif %>% filter(taxonrank == "SPECIES")
#[1] 606079     44
myosotis_occurrence <- myosotis_occurrence %>% drop_na(decimallatitude) %>% drop_na(decimallongitude)
#[1] 566540     44
myosotis_occurrence <- dplyr::rename(myosotis_occurrence, latitude = decimallatitude, longitude = decimallongitude)

#make a map
#This works, up to 75,000 records
myos_75000 <- sample_n(myosotis_occurrence, 75000)
map <- leaflet(myos_75000) %>% addTiles() %>% addCircleMarkers(~longitude, ~latitude, popup = myos_75000$species, radius = 1, weight = 2, opacity = 0.5, fill = TRUE, fillOpacity = 0.2, ) 


R crashed on my computer when I tried the above with 100,000 records so I didn’t try any more than that.

Another great idea on the same website was to show the data a slightly different way, i.e. cluster the records on the map by area and show the number of records per area:

#cluster markers together regionally, especially good when using large numbers of records
#remove unnecessary columns
myos_slim <- select(myos_75000, gbifid, kingdom, species, latitude, longitude, locality)
#A tibble: 6 x 6
#gbifid kingdom species latitude lontigude locality
#1 1341005999 Plantae Myosotis ramosissima     50.3     10.6  NA                             
#2 1425296791 Plantae Myosotis arvensis        59.7     10.9  Oppsand, vestre gården, Ski, Ak
#3 1280144258 Plantae Myosotis arvensis        51.4      4.27 NA                             
#4 1279455450 Plantae Myosotis scorpioides     51.9      4.73 NA                             
#5 1515594405 Plantae Myosotis laxa            57.7     -2.59 NA                             
#6 1513384263 Plantae Myosotis arvensis        56.4     -5.48 NA   
#[1] 75000     6
map <- leaflet(myos_slim) %>% addTiles()  %>% addCircleMarkers(~longitude, ~latitude, popup = myos_slim$species, radius = 1, weight = 2, opacity = 0.5, fill = TRUE, 
                                                               fillOpacity = 0.2, clusterOptions = markerClusterOptions())

I’ve managed to do what I needed for the present moment, but I’ll come back to this in future and hope others can add advice to this thread. Thanks in advance!


Thanks @heidimeudt for the two posts on this. For large amounts of data I like your approach to cluster data with leaflet - and that way you can zoom in to see details as well. I’d be interested if you think there’s anything we can do in the mapr package to facilitate visualizing GBIF data

We do have a pull request into the rgbif package to interact with the GIBF maps API so that you should be able to get a very quick map of lots and lots of data. That’s something to look out for as well.


Cool, thanks @sckott. Just for clarification it wasn’t really my approach, I just followed :slight_smile:

Getting a “very quick map of lots and lots of data” would be awesome! Please let me know when this is possible, as making my basic map took a lot more time than I thought it would. :wink:

One additional thing that would be nice to do with visualizing GBIF data would be to be able to produce something like the Myos_slim_75000.jpg figure above, but instead of showing numbers of individuals, it would show numbers of species per area. You can do something similar with the package ‘monographaR’ (Reginato 2016) to create a diversity heat map (mapDiversity). This shows number of species per area as shaded grid cells.


:wave: right, will try to ping you once we have a solution using gbif’s tile maps

That’s interesting about the species per area heat map. I’ve opened an issue in mapr to think this over