Feedback on text mining in rcrossref package

rcrossref is our R client to interact with various Crossref APIs. One new feature we have in rcrossref is support for text mining. This is a bit complicated, but here’s the simple version: Publishers that work with Crossref gives Crossref URLs for text mining purposes (e.g., http://plos.org/10.11111/download). And those links are provided when you ask for metadata on a Crossref DOI. Then we can use those links to download content in various formats, including xml, pdf, and sometimes plain text.

There are probably lots of edge cases given all the different publishers out there, so we’d love to get any feedback on this package to help squash as many bugs as possible.

The package is at https://github.com/ropensci/rcrossref Installation instructions https://github.com/ropensci/rcrossref#installation

Here’s a quick demo of getting text mining links and getting text itself:

Get links

Search for articles first with cr_works(), limiting to only those with full text, and make a vector of DOIs

out <- cr_works(filter=c(has_full_text = TRUE))
dois <- out$data$DOI

Then use cr_ft_links() to get links

cr_ft_links(dois[1], "all")
#> $xml
#> <url> http://api.elsevier.com/content/article/PII:S0362546X97007037?httpAccept=text/xml
#> 
#> $plain
#> <url> http://api.elsevier.com/content/article/PII:S0362546X97007037?httpAccept=text/plain

Get full text

Search for articles from Pensoft publisher, and get links

out <- cr_members(2258, filter=c(has_full_text = TRUE), works = TRUE)
(links <- cr_ft_links(out$data$DOI[1], "all"))
#> $pdf
#> <url> http://phytokeys.pensoft.net/lib/ajax_srv/article_elements_srv.php?action=download_pdf&item_id=4190
#> 
#> $xml
#> <url> http://phytokeys.pensoft.net/lib/ajax_srv/article_elements_srv.php?action=download_xml&item_id=4190

Then get xml

<r> cr_ft_text(links, 'xml')
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//TaxonX//DTD Taxonomic Treatment Publishing DTD v0 20100105//EN" "tax-treatment-NS0.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tp="http://www.plazi.org/taxpub" article-type="research-article" dtd-version="3.0" xml:lang="en">
  <front>
    <journal-meta>

... cutoff

or get PDF

cr_ft_text(links, "pdf")
#> <document>/Users/sacmac/.crossref/10.3897.phytokeys.42.7604.pdf
#>   Pages: 7
#>   Title: Dorstenia luamensis (Moraceae), a new species from eastern Democratic Republic of Congo
#>   Producer: Adobe PDF Library 10.0.1
#>   Creation date: 2014-10-24