rcrossref
is our R client to interact with various Crossref APIs. One new feature we have in rcrossref
is support for text mining. This is a bit complicated, but here’s the simple version: Publishers that work with Crossref gives Crossref URLs for text mining purposes (e.g., http://plos.org/10.11111/download). And those links are provided when you ask for metadata on a Crossref DOI. Then we can use those links to download content in various formats, including xml, pdf, and sometimes plain text.
There are probably lots of edge cases given all the different publishers out there, so we’d love to get any feedback on this package to help squash as many bugs as possible.
The package is at https://github.com/ropensci/rcrossref Installation instructions https://github.com/ropensci/rcrossref#installation
Here’s a quick demo of getting text mining links and getting text itself:
Get links
Search for articles first with cr_works()
, limiting to only those with full text, and make a vector of DOIs
out <- cr_works(filter=c(has_full_text = TRUE))
dois <- out$data$DOI
Then use cr_ft_links()
to get links
cr_ft_links(dois[1], "all")
#> $xml
#> <url> http://api.elsevier.com/content/article/PII:S0362546X97007037?httpAccept=text/xml
#>
#> $plain
#> <url> http://api.elsevier.com/content/article/PII:S0362546X97007037?httpAccept=text/plain
Get full text
Search for articles from Pensoft publisher, and get links
out <- cr_members(2258, filter=c(has_full_text = TRUE), works = TRUE)
(links <- cr_ft_links(out$data$DOI[1], "all"))
#> $pdf
#> <url> http://phytokeys.pensoft.net/lib/ajax_srv/article_elements_srv.php?action=download_pdf&item_id=4190
#>
#> $xml
#> <url> http://phytokeys.pensoft.net/lib/ajax_srv/article_elements_srv.php?action=download_xml&item_id=4190
Then get xml
<r> cr_ft_text(links, 'xml')
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//TaxonX//DTD Taxonomic Treatment Publishing DTD v0 20100105//EN" "tax-treatment-NS0.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tp="http://www.plazi.org/taxpub" article-type="research-article" dtd-version="3.0" xml:lang="en">
<front>
<journal-meta>
... cutoff
or get PDF
cr_ft_text(links, "pdf")
#> <document>/Users/sacmac/.crossref/10.3897.phytokeys.42.7604.pdf
#> Pages: 7
#> Title: Dorstenia luamensis (Moraceae), a new species from eastern Democratic Republic of Congo
#> Producer: Adobe PDF Library 10.0.1
#> Creation date: 2014-10-24