fulltext v1: text-mining scholarly works

Authors: Scott Chamberlain

Text-mining - the art of answering questions by extracting patterns, data, etc. out of the published literature - is not easy.

It’s made incredibly difficult because of publishers. It is a fact that the vast majority of publicly funded research across the globe is published in paywall journals. That is, taxpayers pay twice for research: once for the grant to fund the work, then again to be able to read it. These paywalls mean that every potential person text-mining will have different access: some have access through their university, some may have access through their company, and others may only have access to whatever happens to be open access. On top of that, access for paywall journals often depends on your IP address - something not generally on top of mind for most people.

Another hardship with text-mining is the huge number of publishers together with no standardized way to figure out the URL for full text versions of a scholarly work. There is the DOI (Digital Object Identifier) system used by Crossref, Datacite and others, but those generally help you sort out the location of the scholarly work on a web page - the html version. What one probably wants for text-mining is the PDF or XML version if available. Publishers can optionally choose to include URLs for full text (PDF and/or XML) with Crossref’s metadata (e.g., see this Crossref API call and search for “link” on the page), but the problem is that it’s optional.

fulltext is a package to help R users address the above problems, and get published literature from the web in it’s many forms, and across all publishers.


Read the rest at https://ropensci.org/technotes/2018/01/17/fulltext-v1/

1 Like

Hi Scott,
Thank you so much for this powerful tool. Right now it only allows extracting ten records per search. Is there any way to combine search terms, and get all the outputs to a df?


> res <- ft_search(query = 'covid', from = 'bmc', limit = 10)
> res$bmc
Query: [covid] 
Records found, returned: [66677, 10] 
License: [variable, see `openaccess` field in results] 
# A tibble: 10 x 17
   contenttype  identifier  url    title     creators publicationname   openaccess doi    publisher publicationdate
   <chr>        <chr>       <list> <chr>     <list>   <chr>             <chr>      <chr>  <chr>     <chr>          
 1 Chapter      doi:10.100… <df [… Comparat… <df [9 … Tracking and Pre… false      10.10… Springer  2022-01-01     
 2 Chapter      doi:10.100… <df [… COVID-19… <df [1 … Smart Villages    false      10.10… Springer  2022-01-01     
 3 Chapter Con… doi:10.100… <df [… Nonlinea… <df [4 … Expert Clouds an… false      10.10… Springer  2022-01-01     
 4 Chapter Con… doi:10.100… <df [… Public R… <df [3 … Intelligent Comp… false      10.10… Springer  2022-01-01     
 5 Chapter Con… doi:10.100… <df [… Modern T… <df [4 … Applied Informat… false      10.10… Springer  2022-01-01     
 6 Chapter Con… doi:10.100… <df [… Mapping … <df [4 … Intelligent Comp… false      10.10… Springer  2022-01-01     
 7 Chapter Con… doi:10.100… <df [… Towards … <df [3 … WITS 2020         false      10.10… Springer  2022-01-01     
 8 Chapter Con… doi:10.100… <df [… Death Pr… <df [2 … Advanced Computi… false      10.10… Springer  2022-01-01     
 9 Chapter Con… doi:10.100… <df [… Modeling… <df [4 … Artificial Intel… false      10.10… Springer  2022-01-01     
10 Chapter      doi:10.100… <df [… Applicat… <df [1 … Tracking and Pre… false      10.10… Springer  2022-01-01     
# … with 7 more variables: publicationtype <chr>, printisbn <chr>, electronicisbn <chr>, isbn <chr>, genre <chr>,
#   copyright <chr>, abstract <chr>