Mining Dataset Usage Metrics from Published Articles


Here I am again with a nebulous question.

Does anyone know of an established method for automatically extracting citations to data sets from published articles (in R or elsewhere)?

e.g. Water Survey of Canada requests that users cite their data with

“Extracted from the Environment and Climate Change Canada Real-time Hydrometric Data web site ( on [DATE]”

Obviously I could do this by searching a subset of the above citation (in Google Scholar or wherever), downloading the full text and then parsing to see if that citation phrase occurs in the text/biblio but that seems pretty excessive…

I perused rcrossref but I don’t think this functionality exists there (paging @sckott to prove me wrong).


thanks for your question @rywhale

I’m not sure I completely understand the use case. You want to find certain citations in articles? if so, do you they have to match an exact citation, or just be citing a particular source no matter the citation format. ?


@sckott Yes, I want to find citations to a specific data set in articles. For the sake of example, let’s take the Hydrometric Data web site I linked above. I suppose the logical starting point would be to look for citations that match the format they specify. Then I might look for any citations that contain the URL (or a portion of it).

An example conclusion might be: “We found that 42 published articles cited the data set.”

Edit: just realized that I could stick the domain from the URL into Google Scholar and get articles that contain that URL in the body.


Okay, got it.

The problem is that citation data is very closely guarded. Publishers do not want to give it up without getting a lot of money in return. There are some options though.

As you said you can do the google scholar approach. I actually do that with dozens of email alerts, to find citations of ropensci software. It’s a bit manual, but works quite well, finding mentions of an R package even if there’s no citation. There’s no way to progrmatically use Google scholar that’s legal - they don’t want programmatic usage.

You can explore a few sources of citation data to see if any work for you:

With Crossref and Datacite you can search by DOI if your dataset(s) have DOI’s - you can also do a full search with them (though it’s generally not a full text search, but rather a search over the metadata fields that are public, like title, authors, keywords, - sometimes abstract is available)

Let me know if you want to explore OCC.

If datasets have DOIs you can seek citation data more easily. Do many datasets you seek have DOIs?

The short answer is that if the datasets DO NOT have DOIs then google scholar alserts is probably easiest, while if they do then you can probably track citations of those DOIs through datacite