As part of a collaboration between the California Academy of Sciences and rOpenSci, we’ve been working on a package to help researchers find relevant information from the literature on their species of interest.
The package is spplit
- on GitHub at https://github.com/ropenscilabs/spplit
We want to connect help connect users to many sources of literature. However, right now we just have connectors for the Biodiversity Heritage Library - We’ll add more in the future, e.g., to Wikipedia, journal articles, etc.
If you have any interest in this, please do give it a try, and let us know what you think. And/or let us know your use cases that have to do with connecting species that you study to the literature. We have a tag for use cases in the Github repo - so you can see what use cases we are thinking about.
Thanks!
Installation
devtools::install_github(c("ropensci/rgbif", "ropensci/spocc"))
devtools::install_github("ropenscilabs/spplit")
library("spplit")
Example usage
First, get a BHL API key from http://www.biodiversitylibrary.org/getapikey.aspx - Once you have the key put it in as an environment variable, either in your .Renviron file, or other bash env file. OR put in your .Rprofile file as an R option. We’ll use this key in spplit
- Alternatively, you can pass in your BHL key in the function calls for BHL functions in spplit
Search for occurrences
geom <- 'POLYGON((-124.07 41.48,-119.99 41.48,-119.99 35.57,-124.07 35.57,-124.07 41.48))'
x <- sp_occ_idigbio(geometry = geom, limit = 3)
Get species list to search against the Biodiversity Heritage Library (BHL)
x <- sp_list(x)
Search the BHL, gives metadata
x <- sp_bhl_meta(x)[1:3]
Pass metadata to get OCR’ed pages
res <- sp_bhl_ocr(x)
Save text to disk (or any database, etc.)
res %>% sp_bhl_save()
Mine the text
library("tm")
src <- VectorSource(unlist(res, use.names = FALSE))
corp <- VCorpus(src)
corp <- tm_map(corp, removeWords, stopwords("english"))
corp <- tm_map(corp, stripWhitespace)
corp <- tm_map(corp, removePunctuation)
tdm <- TermDocumentMatrix(corp)
findFreqTerms(tdm, lowfreq = 200)
#> [1] "000" "050" "agropyron" "beckmannia" "bromus" "carex" "common"
#> [8] "disturbance" "draba" "grass" "long" "native" "plant" "plants"
#> [15] "sedge" "species" "spp" "syzigachne" "the" "var" "water"