I’ve been working on an R package that wraps an API that serves US patent data, and I was considering submitting it to rOpenSci. However, I noticed that you all have a policy that says you don’t support packages dealing with gov data (though I did notice a few packages like noaa that suggest otherwise - perhaps those came before the policy). I’m wondering what the impetus was for this policy, and how you all see yourself in relation to rOpenGov?
Thanks for your reply. The use cases for patent data in R are similar to those for publication data. For example, people who manage intellectual property assets like patent portfolios may want to: know who the top inventors or assignees are in particular field, find similar patents to a given patent, map out the research landscape in a particular technology area, etc. More generally speaking, patent analysis can include calculating similarity measures (e.g., tfidf cosine similarity), simple aggregation/descriptive statistics, topic modeling, predictive analytics, network analysis, etc.
There’s a ton of literature surrounding data-driven IP management and using patent data to inform S&T planning. Maybe these examples will provide more context:
It’s great news Chris that you are working on an R package for the USPTO data. In terms of the use cases, there is the issue of access to patent data by patent offices and researchers from universities in developing countries who do not have the resources to pay the high costs of commercial databases. In the patent analytics training for patent offices and researchers in developing countries that WIPO is running (and I teach on) we have been teaching introductory analytics including R and ropensci packages (mainly rplos at the moment as an easy intro). The USPTO has recently moved to an open access service in JSON format and I assume Chris this is what you have been working with?
In addition to the uses by patent offices and researchers in developing countries the bibliometrics and scientometrics community makes a lot of use of patent data ranging from basic technology trends, to text and technology mining, policy analysis and econometrics. Top journals in that area are Scientometrics and Research Policy. So, in my view Scott there is a large research community out there who want to work with patent data but typically can’t afford it except in quite limited form. In my own work at Manchester University with the scientometrics team at the business school I am working on pushing R as a means for easier access to patent data and for wrangling patent data for a range of analytics purposes.
I would mention that I have also been experimenting with creating some R packages for patent data (opsr for the European Patent Office Open Patent Services API) and the Lens patent database (which advocates open source and open access). I am presently purrring my way through the eternal nested lists of European Patent data with a view to eventual submission to ropensci and am also some way along with the lensr package. In the case of WIPO training for developing countries we use an early dev version of the opsr package to introduce the idea and will also shortly start demoing the lensr package. If a USPTO data package becomes available it is pretty much a racing certainty it would be used in the analytics training around the world.
The work the USPTO has been doing for the API is pretty cool… it seems well formatted (compared with the other data sources) and includes lots of possibilities for things like mapping inventor locations with leaflet etc… or digging into the literature citations in patents and linking across to other APIs such as crossref, or text mining with tidy text mining and so on etc. The patents view API is the main point of interest at the moment.
Apologies for being long winded but the key point is that there are a number of different communities who would I think make use of a package to access USPTO data in R. My own view on the rOpenSci or rOpenGov side of things is that patent data is fundamentally about science and technology and the exciting aspects of patent data are what it has to tell us about trends and developments in science and technology. In my case, that involves linking biology and patent data (e.g. taxize and rgbif and rcrossreft etc). That however, is just my take on things. So, I would like to support Chris in this exciting idea.
For context: rOpenGov (as well rOpenHealth and cloudyr) has overlapping subject areas with us. But rOpenGov and the others, at this stage, don’t do review of the sort we do - they tend to more package hosts and project incubators. So, in general, we’ve begun to accept packages that fit into our fit guidelines even if it does match the topic area of one of the other groups. We need up update our instructions to reflect this. While we everything is subject review at the point of submission, topic-wise, we’d welcome the packages @poldham described above.
Thanks for taking the time to respond. As Paul guessed, I have been writing a wrapper around the PatentsView API called patentsview. From my view, each of the open source patent APIs have their own advantages and disadvantages, which are mostly a reflection of the data that they offer and what fields can be searched. I particularly like the Lens, as you can search on the full text of the patent which is very uncommon. I doubt that the PatentsView API will ever offer that, though they are adding new fields on a regular basis - in fact, I’m waiting on the new version of the API to submit my package to CRAN. It sounds like it also might work for rOpenSci, which would be great, especially if it could be among friends like lensr and opsr. I’ll read over the submission guidelines in more depth and take it from there.