Using rcrossref to generate set of sample DOIs

mfenner · January 20, 2015, 7:20pm

The cr_r function of the rcrossref package is great to generate a list of random DOIs, e.g.

dois <- cr_r(15000, filter=c(from_pub_date='2013-01-01', until_pub_date='2013-12-31', type='journal-article'))
csv <- unique(as.data.frame(dois))

What I would like to do is collect additional information (date-issued, title) about these DOIs, not just the DOI itself. How can I best do that?

sckott · January 20, 2015, 8:13pm

hi @mfenner - thanks for the question.

If it wasn’t clear, cr_r() wraps cr_works(), and you pass on args to cr_works().

Thus, to get all details just use cr_works() with the sample parameter.

cr_works(sample = 2)

#> $meta
#>   total_results search_terms start_index items_per_page
#> 1      71392881           NA           0             20
#> 
#> $data
#> Source: local data frame [2 x 21]
#> 
#>   issued score                                prefix                   container.title reference.count  deposited
#> 1   1992     1                                  none MR Imaging of the Skull and Brain               0 2011-11-16
#> 2 2000-3     1 http://id.crossref.org/prefix/10.1016                 Nuclear Physics A               0   2012-1-9
#> Variables not shown: title (chr), type (chr), DOI (chr), URL (chr), source (chr), publisher (chr), indexed (chr), page
#>   (chr), ISBN (chr), subject (chr), author (chr), issue (chr), ISSN (chr), volume (chr), member (chr)
#> 
#> $facets
#> NULL

Alternatively, we could add a parameter to cr_r() to give back all details instead of just DOIs - do you think you’d prefer that? So that would then allow

cr_r(10, details = TRUE)

to give back results as above for cr_works()

sckott · January 22, 2015, 8:04pm

@mfenner did that answer your question? any thoughts?

mfenner · January 23, 2015, 8:23pm

Scott, thanks for the explanation. Using cr_works is fine, I don’t think there is a need to change cr_r.

sckott · January 23, 2015, 8:42pm

@mfenner great, glad that works

mfenner · January 24, 2015, 5:21pm

The code I ended up using is the following:

library('rcrossref')
library('stringr')

splitColumn <- function(df, colname) {
  string <- df[colname]
  str_split(iconv(string, "latin1", "UTF-8"), ",")[[1]][1]
}

trimWhitespace <-  function(df, colname) {
  string <- df[colname]
  iconv(gsub("\\s+"," ",string), "latin1", "UTF-8")
}

for (i in 1:20) {
  result <- cr_works(sample = 1000, filter=c(from_pub_date='2013-01-01', until_pub_date='2013-12-31', type='journal-article'))
  data <- unique(as.data.frame(result$data))
  data$doi <- data$DOI
  data$publication_date <- data$issued
  data$title <- apply(data, 1, trimWhitespace, colname = "title")
  # only fetch first journal title
  data$journal <- apply(data, 1, splitColumn, colname = "container.title")
  # only fetch first ISSN
  data$issn <- apply(data, 1, splitColumn, colname = "ISSN")
  # extract CrossRef member id from URL
  data$member_id <- str_extract(data$member, "\\d+")
  data <- subset(data, select=c("doi", "publication_date", "title", "journal", "issn", "publisher", "member_id"))
  file <- paste("random_crossref_dois", "_", "2013", "_", i, ".csv", sep = "")
  write.csv(data, file, row.names=FALSE, fileEncoding="UTF-8")
}

I did a pull request to format dates with zero padding, e.g. 2013-02-08 instead of 2013-2-8 to be more consistent with iso8601. And as you can see I do some tweaking with ISSN, container.title and title.

sckott · January 26, 2015, 5:35pm

thanks for sharing @mfenner - Given your use case here, anything else you think could be improved?

mfenner · January 27, 2015, 1:12pm

I think the current functionality is fine. If you want to improve the library, I would suggest to add some Citeproc parsing functionality, e.g. for dates and author names. There are other places besides the CrossRef REST API that use Citeproc JSON (e.g. lagotto for dates), so that some of the code could be reused.

sckott · January 27, 2015, 5:22pm

@mfenner Good idea on citeproc. Looking at that now. It might make sense to have a sep. package for this functionality…

sckott · January 27, 2015, 7:14pm

@mfenner okay, started to play around with CSL parsing separately https://github.com/sckott/csl

mfenner · January 27, 2015, 8:27pm

Cool. You should also look at Citeproc JSON, the underlying data model. e.g. here: https://github.com/citation-style-language/schema/blob/master/csl-data.json

sckott · January 27, 2015, 10:11pm

Thanks for the link Martin. Checking it out

Topic		Replies	Views
Feedback on text mining in rcrossref package Package Use Questions	0	1316	January 16, 2015
How to formulate OR logic when using rcrossref Package Use Questions r , onboarding , package	3	777	June 15, 2018
rcrossref for #TidyTuesday UseCases visualization , rcrossref	2	1819	September 11, 2019
New package idea: book metadata API Wishlist books , humanities	2	1186	July 24, 2018
Package to deposit data to Zenodo? Wishlist r , community	4	878	December 3, 2020

Using rcrossref to generate set of sample DOIs

Related topics