Feedback on pkg for connecting biodiversity data and literature

As part of a collaboration between the California Academy of Sciences and rOpenSci, we’ve been working on a package to help researchers find relevant information from the literature on their species of interest.

The package is spplit - on GitHub at https://github.com/ropenscilabs/spplit

We want to connect help connect users to many sources of literature. However, right now we just have connectors for the Biodiversity Heritage Library - We’ll add more in the future, e.g., to Wikipedia, journal articles, etc.

If you have any interest in this, please do give it a try, and let us know what you think. And/or let us know your use cases that have to do with connecting species that you study to the literature. We have a tag for use cases in the Github repo - so you can see what use cases we are thinking about.

Thanks!

Installation

devtools::install_github(c("ropensci/rgbif", "ropensci/spocc"))
devtools::install_github("ropenscilabs/spplit")
library("spplit")

Example usage

First, get a BHL API key from http://www.biodiversitylibrary.org/getapikey.aspx - Once you have the key put it in as an environment variable, either in your .Renviron file, or other bash env file. OR put in your .Rprofile file as an R option. We’ll use this key in spplit - Alternatively, you can pass in your BHL key in the function calls for BHL functions in spplit

Search for occurrences

geom <- 'POLYGON((-124.07 41.48,-119.99 41.48,-119.99 35.57,-124.07 35.57,-124.07 41.48))'
x <- sp_occ_idigbio(geometry = geom, limit = 3)

Get species list to search against the Biodiversity Heritage Library (BHL)

x <- sp_list(x)

Search the BHL, gives metadata

x <- sp_bhl_meta(x)[1:3]

Pass metadata to get OCR’ed pages

res <- sp_bhl_ocr(x)

Save text to disk (or any database, etc.)

res %>% sp_bhl_save()

Mine the text

library("tm")
src <- VectorSource(unlist(res, use.names = FALSE))
corp <- VCorpus(src)
corp <- tm_map(corp, removeWords, stopwords("english"))
corp <- tm_map(corp, stripWhitespace)
corp <- tm_map(corp, removePunctuation)
tdm <- TermDocumentMatrix(corp)
findFreqTerms(tdm, lowfreq = 200)
#> [1] "000"         "050"         "agropyron"   "beckmannia"  "bromus"      "carex"       "common"     
#>  [8] "disturbance" "draba"       "grass"       "long"        "native"      "plant"       "plants"     
#> [15] "sedge"       "species"     "spp"         "syzigachne"  "the"         "var"         "water" 

Apologies - had a small bug in checking for BHL API key and forgot to mention that user’s need an API key, the above instructions are updated, and fix pushed to the package. Get a BHL key, and reinstall and try again if you’ve already installed