Authors: Scott Chamberlain
Text-mining - the art of answering questions by extracting patterns, data, etc. out of the published literature - is not easy.
It’s made incredibly difficult because of publishers. It is a fact that the vast majority of publicly funded research across the globe is published in paywall journals. That is, taxpayers pay twice for research: once for the grant to fund the work, then again to be able to read it. These paywalls mean that every potential person text-mining will have different access: some have access through their university, some may have access through their company, and others may only have access to whatever happens to be open access. On top of that, access for paywall journals often depends on your IP address - something not generally on top of mind for most people.
Another hardship with text-mining is the huge number of publishers together with no standardized way to figure out the URL for full text versions of a scholarly work. There is the DOI (Digital Object Identifier) system used by Crossref, Datacite and others, but those generally help you sort out the location of the scholarly work on a web page - the html version. What one probably wants for text-mining is the PDF or XML version if available. Publishers can optionally choose to include URLs for full text (PDF and/or XML) with Crossref’s metadata (e.g., see this Crossref API call and search for “link” on the page), but the problem is that it’s optional.
fulltext is a package to help R users address the above problems, and get published literature from the web in it’s many forms, and across all publishers.
Read the rest at https://ropensci.org/technotes/2018/01/17/fulltext-v1/