Using rcrossref to generate set of sample DOIs

The cr_r function of the rcrossref package is great to generate a list of random DOIs, e.g.

dois <- cr_r(15000, filter=c(from_pub_date='2013-01-01', until_pub_date='2013-12-31', type='journal-article'))
csv <- unique(

What I would like to do is collect additional information (date-issued, title) about these DOIs, not just the DOI itself. How can I best do that?

1 Like

hi @mfenner - thanks for the question.

If it wasn’t clear, cr_r() wraps cr_works(), and you pass on args to cr_works().

Thus, to get all details just use cr_works() with the sample parameter.

cr_works(sample = 2)

#> $meta
#>   total_results search_terms start_index items_per_page
#> 1      71392881           NA           0             20
#> $data
#> Source: local data frame [2 x 21]
#>   issued score                                prefix                   container.title reference.count  deposited
#> 1   1992     1                                  none MR Imaging of the Skull and Brain               0 2011-11-16
#> 2 2000-3     1                 Nuclear Physics A               0   2012-1-9
#> Variables not shown: title (chr), type (chr), DOI (chr), URL (chr), source (chr), publisher (chr), indexed (chr), page
#>   (chr), ISBN (chr), subject (chr), author (chr), issue (chr), ISSN (chr), volume (chr), member (chr)
#> $facets

Alternatively, we could add a parameter to cr_r() to give back all details instead of just DOIs - do you think you’d prefer that? So that would then allow

cr_r(10, details = TRUE)

to give back results as above for cr_works()

@mfenner did that answer your question? any thoughts?

Scott, thanks for the explanation. Using cr_works is fine, I don’t think there is a need to change cr_r.

@mfenner great, glad that works

The code I ended up using is the following:


splitColumn <- function(df, colname) {
  string <- df[colname]
  str_split(iconv(string, "latin1", "UTF-8"), ",")[[1]][1]

trimWhitespace <-  function(df, colname) {
  string <- df[colname]
  iconv(gsub("\\s+"," ",string), "latin1", "UTF-8")

for (i in 1:20) {
  result <- cr_works(sample = 1000, filter=c(from_pub_date='2013-01-01', until_pub_date='2013-12-31', type='journal-article'))
  data <- unique($data))
  data$doi <- data$DOI
  data$publication_date <- data$issued
  data$title <- apply(data, 1, trimWhitespace, colname = "title")
  # only fetch first journal title
  data$journal <- apply(data, 1, splitColumn, colname = "container.title")
  # only fetch first ISSN
  data$issn <- apply(data, 1, splitColumn, colname = "ISSN")
  # extract CrossRef member id from URL
  data$member_id <- str_extract(data$member, "\\d+")
  data <- subset(data, select=c("doi", "publication_date", "title", "journal", "issn", "publisher", "member_id"))
  file <- paste("random_crossref_dois", "_", "2013", "_", i, ".csv", sep = "")
  write.csv(data, file, row.names=FALSE, fileEncoding="UTF-8")

I did a pull request to format dates with zero padding, e.g. 2013-02-08 instead of 2013-2-8 to be more consistent with iso8601. And as you can see I do some tweaking with ISSN, container.title and title.

thanks for sharing @mfenner - Given your use case here, anything else you think could be improved?

I think the current functionality is fine. If you want to improve the library, I would suggest to add some Citeproc parsing functionality, e.g. for dates and author names. There are other places besides the CrossRef REST API that use Citeproc JSON (e.g. lagotto for dates), so that some of the code could be reused.

@mfenner Good idea on citeproc. Looking at that now. It might make sense to have a sep. package for this functionality…

@mfenner okay, started to play around with CSL parsing separately

Cool. You should also look at Citeproc JSON, the underlying data model. e.g. here:

Thanks for the link Martin. Checking it out