As part of a collaboration between the California Academy of Sciences and rOpenSci, we’ve been working on a package to help researchers find relevant information from the literature on their species of interest.
The package is
spplit - on GitHub at https://github.com/ropenscilabs/spplit
We want to connect help connect users to many sources of literature. However, right now we just have connectors for the Biodiversity Heritage Library - We’ll add more in the future, e.g., to Wikipedia, journal articles, etc.
If you have any interest in this, please do give it a try, and let us know what you think. And/or let us know your use cases that have to do with connecting species that you study to the literature. We have a tag for use cases in the Github repo - so you can see what use cases we are thinking about.
devtools::install_github(c("ropensci/rgbif", "ropensci/spocc")) devtools::install_github("ropenscilabs/spplit") library("spplit")
First, get a BHL API key from http://www.biodiversitylibrary.org/getapikey.aspx - Once you have the key put it in as an environment variable, either in your .Renviron file, or other bash env file. OR put in your .Rprofile file as an R option. We’ll use this key in
spplit - Alternatively, you can pass in your BHL key in the function calls for BHL functions in
Search for occurrences
geom <- 'POLYGON((-124.07 41.48,-119.99 41.48,-119.99 35.57,-124.07 35.57,-124.07 41.48))' x <- sp_occ_idigbio(geometry = geom, limit = 3)
Get species list to search against the Biodiversity Heritage Library (BHL)
x <- sp_list(x)
Search the BHL, gives metadata
x <- sp_bhl_meta(x)[1:3]
Pass metadata to get OCR’ed pages
res <- sp_bhl_ocr(x)
Save text to disk (or any database, etc.)
res %>% sp_bhl_save()
Mine the text
library("tm") src <- VectorSource(unlist(res, use.names = FALSE)) corp <- VCorpus(src) corp <- tm_map(corp, removeWords, stopwords("english")) corp <- tm_map(corp, stripWhitespace) corp <- tm_map(corp, removePunctuation) tdm <- TermDocumentMatrix(corp) findFreqTerms(tdm, lowfreq = 200) #>  "000" "050" "agropyron" "beckmannia" "bromus" "carex" "common" #>  "disturbance" "draba" "grass" "long" "native" "plant" "plants" #>  "sedge" "species" "spp" "syzigachne" "the" "var" "water"