Feedback on text mining in rcrossref package

sckott · January 16, 2015, 9:16pm

rcrossref is our R client to interact with various Crossref APIs. One new feature we have in rcrossref is support for text mining. This is a bit complicated, but here’s the simple version: Publishers that work with Crossref gives Crossref URLs for text mining purposes (e.g., http://plos.org/10.11111/download). And those links are provided when you ask for metadata on a Crossref DOI. Then we can use those links to download content in various formats, including xml, pdf, and sometimes plain text.

There are probably lots of edge cases given all the different publishers out there, so we’d love to get any feedback on this package to help squash as many bugs as possible.

The package is at https://github.com/ropensci/rcrossref Installation instructions https://github.com/ropensci/rcrossref#installation

Here’s a quick demo of getting text mining links and getting text itself:

Get links

Search for articles first with cr_works(), limiting to only those with full text, and make a vector of DOIs

out <- cr_works(filter=c(has_full_text = TRUE))
dois <- out$data$DOI

Then use cr_ft_links() to get links

cr_ft_links(dois[1], "all")
#> $xml
#> <url> http://api.elsevier.com/content/article/PII:S0362546X97007037?httpAccept=text/xml
#> 
#> $plain
#> <url> http://api.elsevier.com/content/article/PII:S0362546X97007037?httpAccept=text/plain

Get full text

Search for articles from Pensoft publisher, and get links

out <- cr_members(2258, filter=c(has_full_text = TRUE), works = TRUE)
(links <- cr_ft_links(out$data$DOI[1], "all"))
#> $pdf
#> <url> http://phytokeys.pensoft.net/lib/ajax_srv/article_elements_srv.php?action=download_pdf&item_id=4190
#> 
#> $xml
#> <url> http://phytokeys.pensoft.net/lib/ajax_srv/article_elements_srv.php?action=download_xml&item_id=4190

Then get xml

<r> cr_ft_text(links, 'xml')
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//TaxonX//DTD Taxonomic Treatment Publishing DTD v0 20100105//EN" "tax-treatment-NS0.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tp="http://www.plazi.org/taxpub" article-type="research-article" dtd-version="3.0" xml:lang="en">
  <front>
    <journal-meta>

... cutoff

or get PDF

cr_ft_text(links, "pdf")
#> <document>/Users/sacmac/.crossref/10.3897.phytokeys.42.7604.pdf
#>   Pages: 7
#>   Title: Dorstenia luamensis (Moraceae), a new species from eastern Democratic Republic of Congo
#>   Producer: Adobe PDF Library 10.0.1
#>   Creation date: 2014-10-24

Topic		Replies	Views
I can haz text mining in R General Q&A literature , text-mining	0	1338	October 14, 2015
New package: fulltext Package Use Questions literature , openaccess	1	1968	August 7, 2015
rOpenSci \| fulltext: Behind the Scenes Blog	0	298	February 5, 2021
fulltext v1: text-mining scholarly works Blog r , text-mining , fulltext	1	836	July 22, 2021
Generate pdf from xml General Q&A r , text-mining	3	878	December 11, 2017

Feedback on text mining in rcrossref package

Get links

Get full text

Related topics