Possible new pkg idea: publication bias

A tedious task for many of the meta-analyses I do is to retrieve an article count for every data point, usually species names, from different publication databases like Google Scholar, Web of Science, or PubMed. I was hoping there could be a package that would allow me to input my search terms and it would automatically loop them through these systems generate the article count. I would imagine this would be very helpful for a wide range of uses, providing an easy way to assess publication counts for a large database or how publication counts change over time.

Disclosure - I am unfamiliar package building, so if you need clarification or this doesn’t make sense please let me know!

2 Likes

It’s not exactly clear what you mean by ‘article count for every species name’ or how this would help estimate publication bias but

If it is helpful here is a script that queries google scholar for records by species+trait

useful on its own; could also be wrapped or reimplemented in a package

Thanks for your question @arw36

Thanks for sharing @dlebauer

One solution is rcrossref:

library(rcrossref)
library(data.table)

Define a function

species_cr_search <- function(x, ...) {
  data.frame(
    species = x, 
    matches = cr_works(query = x, limit = 0, ...)$meta$total_results,
    stringsAsFactors = FALSE
  )
}

A species list

spp <- c("Poa annua", "Helianthus annuus", "Abies magnifica")

Apply function across species

rbindlist(lapply(spp, species_cr_search))
#>              species matches
#> 1:         Poa annua    2425
#> 2: Helianthus annuus    3446
#> 3:   Abies magnifica    4752

A cool thing about using Crossref is you can set lots of different filters, etc. Here, constrain to publications that have “ecology” in their title

rbindlist(lapply(spp, species_cr_search, flq = c(`query.container-title` = 'ecology')))
#>              species matches
#> 1:         Poa annua      46
#> 2: Helianthus annuus      48
#> 3:   Abies magnifica     356

A caveat about Crossref is that they only search text that they provide in their web services, which is authors, title, and in some cases abstract (http://api.crossref.org/works?filter=has-abstract:true&rows=0 shows about 824K papers) - that is, they’re not searching full text of the papers


You could use Google Scholar but you have to jump through more hoops as they don’t want people to programmatically scrape their data.

You could also use Scopus - e.g., Wrapping Elsevier’s Sciencedirect/Scopus API? but i don’t know much about that.


One approach would be to create a pkg that can interface to many different sources and the user can choose.

3 Likes

Any thoughts @arw36 on the above comments ?

Thanks @sckott and @dlebauer ! Using the rcrossref package is easiest and quick and was a quick solution for this. I ran it for my 280 species, and a minute later had the matches. I am excited to see the modifications I can do with this, like adding search terms as suggested or extracting article counts at different time intervals (publications at 1980 vs 2010 for instance).

One question with using CrossRef -> are matches text matches or individual papers? For instance if “Artibeus fimbriatus” is in the title then multiple times in a papers abstract, would this be counted multiple times even though only one publication?

Also, I am a bit concerned with the specificity of cross ref. For instance a search for Poa annua and “Poa annua” output the same counts, and pick up matches for just Poa or annua, rather than items with Poa and annua/

can you give an example?

which of these do you really want?

See https://github.com/CrossRef/rest-api-doc/blob/master/rest_api.md#parameters which for the query parameter lists

limited DisMax query terms

Which links to https://wiki.apache.org/solr/DisMax and therein to https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser

I don’t know what specifically limited Dismax means, but i’d just go with whats documented on https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser

see also https://github.com/CrossRef/rest-api-doc/blob/master/rest_api.md#queries

Ideally I want the intersection of both the genus AND the species to get publication count. I think it is an issue with how CrossRef does queries, using that limited DisMax method. To solve, I modified the function you provided to run separately for “Genus species”, “Genus”, “species”

hostname <- as.vector(host$hHostNameIUCN)
genus <- as.vector(host$hGenus)
species <- as.vector(host$hSpecies)
species_cr_search <- function(x,y,z, ...) {
  data.frame(
    hostname = x, 
    hmatches = cr_works(query = x, limit = 0, ...)$meta$total_results,
    genus = y, 
    gmatches = cr_works(query = y, limit = 0, ...)$meta$total_results,
    species = z, 
    smatches = cr_works(query = z, limit = 0, ...)$meta$total_results,
    stringsAsFactors = FALSE
  )
}

Then, from the outputed matrix modified to remove duplicates or publications for only one search term, rather than both.

researcheffort <- mapply(species_cr_search, x=hostname, y=genus, z=species)
researcheffort <- t(researcheffort) #flip matrix
rownames(researcheffort) <- c() # get rid of species rownames
researcheffort <- as.data.frame(researcheffort)
researcheffort$publications <- as.numeric(researcheffort$gmatches) + as.numeric(researcheffort$smatches) - as.numeric(researcheffort$hmatches)

What I end up with is a dataframe with separate search terms, then what I believ is the actual publications

Maybe this is a bit round about, but I couldnot query multiple times within the cr_works function.

You can also try fulltext - there are some defaults set for crossref in particular (that you can change) - so keep that in mind

install from dev version

devtools::install_github("ropensci/fulltext")

Taxa list

library(fulltext)
taxa <- c("Acerodon celebensis", "Anoura caudifer", 
          "Anoura geoffroyi", "Antrozous pallidus")

With Entrez

(taxa1_entrez <- ft_search(query = taxa[1], from = 'entrez', limit = 100))
#> Query:
#>   [Acerodon celebensis] 
#> Found:
#>   [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 7; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0] 
#> Returned:
#>   [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 7; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0] 

(taxa3_entrez <- ft_search(query = taxa[3], from = 'entrez', limit = 100))
#> Query:
#>   [Anoura geoffroyi] 
#> Found:
#>   [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 43; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0] 
#> Returned:
#>   [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 43; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0] 

With Scopus

(taxa1_scopus <- ft_search(query = taxa[1], from = 'scopus', limit = 100,
    scopusopts = list(key = Sys.getenv('ELSEVIER_SCOPUS_KEY'))))
#> Query:
#>   [Acerodon celebensis] 
#> Found:
#>   [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 10; Microsoft: 0] 
#> Returned:
#>   [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 10; Microsoft: 0] 

(taxa3_scopus <- ft_search(query = taxa[3], from = 'scopus', limit = 100,
    scopusopts = list(key = Sys.getenv('ELSEVIER_SCOPUS_KEY'))))
#> Query:
#>   [Anoura geoffroyi] 
#> Found:
#>   [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 143; Microsoft: 0] 
#> Returned:
#>   [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 100; Microsoft: 0]

Thanks @sckott . This is definitely the package I was looking for, especially because of it utilizes rentrez to get pubmed results.

library(rentrez)
taxa <- c("Acerodon celebensis", "Anoura caudifer", 
          "Anoura geoffroyi", "Antrozous pallidus")
species_pubmed_search <- function(x,...) {
  entrez_search(db= "pubmed", term = x)
}
taxa_ft <- lapply(taxa, species_pubmed_search)
ft_effort <- stringi::stri_list2matrix(taxa_ft, byrow=TRUE) %>% as.data.frame()
ft_effort <- select(ft_effort, V2)
taxa_df <- as.data.frame(taxa)
taxa_df <- bind_cols(taxa_df, ft_effort)
taxa_df$V2 <- as.numeric(levels(taxa_df$V2))[taxa_df$V2]

V2 = pubmed hits

@arw36 Great! glad it works for you.

Love any feedback on that package :slight_smile: