Possible new pkg idea: publication bias

arw36 · February 6, 2017, 3:48pm

A tedious task for many of the meta-analyses I do is to retrieve an article count for every data point, usually species names, from different publication databases like Google Scholar, Web of Science, or PubMed. I was hoping there could be a package that would allow me to input my search terms and it would automatically loop them through these systems generate the article count. I would imagine this would be very helpful for a wide range of uses, providing an easy way to assess publication counts for a large database or how publication counts change over time.

Disclosure - I am unfamiliar package building, so if you need clarification or this doesn’t make sense please let me know!

dlebauer · February 7, 2017, 2:48am

It’s not exactly clear what you mean by ‘article count for every species name’ or how this would help estimate publication bias but

If it is helpful here is a script that queries google scholar for records by species+trait

github.com

PecanProject/pecan/blob/master/modules/meta.analysis/inst/citation_search.py

#  An automated web crawler was implemented to conduct a systematic survey of published trait values for plants #common to tundra ecosystems. Given a list of species names, species, and parameter search terms, params, the #crawl function searches the Google Scholar search engine for all possible combinations of species and search #terms. Results are compiled and ordered by the frequency at which they occur throughout searches. The top most #frequent results are automatically opened in the user's default web browser. The limit parameter indicates the #maximum number of tabs to open, which was set to 50 for the purpose of our survey. Code extends the xgoogle #library through the GoogleScholarSearch and PubSearchResult classes, which retrieve search pages and represent #search results, respectively.

from xgoogle import BeautifulSoup , SearchResult
import webbrowser

def crawl(species, params, limit=-1): 
    occurences = {} 
    for sp in species: 
        for param in params: 
            searcher = GoogleScholarSearch(['"'+sp+'" ' + param]) 
            results = searcher.search() 
            if len(results) > 1: 
                for result in results: 
                    if result not in occurences: 
                        occurences[result] = 0 
                    occurences[result]+=1 
                break 
    results = sorted(occurences, key=lambda article: occurences[article]) 
    for result in results[:limit]: 
        if result.url:

This file has been truncated. show original

useful on its own; could also be wrapped or reimplemented in a package

sckott · February 7, 2017, 11:12am

Thanks for your question @arw36

Thanks for sharing @dlebauer

One solution is rcrossref:

library(rcrossref)
library(data.table)

Define a function

species_cr_search <- function(x, ...) {
  data.frame(
    species = x, 
    matches = cr_works(query = x, limit = 0, ...)$meta$total_results,
    stringsAsFactors = FALSE
  )
}

A species list

spp <- c("Poa annua", "Helianthus annuus", "Abies magnifica")

Apply function across species

rbindlist(lapply(spp, species_cr_search))
#>              species matches
#> 1:         Poa annua    2425
#> 2: Helianthus annuus    3446
#> 3:   Abies magnifica    4752

A cool thing about using Crossref is you can set lots of different filters, etc. Here, constrain to publications that have “ecology” in their title

rbindlist(lapply(spp, species_cr_search, flq = c(`query.container-title` = 'ecology')))
#>              species matches
#> 1:         Poa annua      46
#> 2: Helianthus annuus      48
#> 3:   Abies magnifica     356

A caveat about Crossref is that they only search text that they provide in their web services, which is authors, title, and in some cases abstract (http://api.crossref.org/works?filter=has-abstract:true&rows=0 shows about 824K papers) - that is, they’re not searching full text of the papers

You could use Google Scholar but you have to jump through more hoops as they don’t want people to programmatically scrape their data.

You could also use Scopus - e.g., Wrapping Elsevier’s Sciencedirect/Scopus API? but i don’t know much about that.

One approach would be to create a pkg that can interface to many different sources and the user can choose.

sckott · March 8, 2017, 12:39am

Any thoughts @arw36 on the above comments ?

arw36 · March 9, 2017, 6:30pm

Thanks @sckott and @dlebauer ! Using the rcrossref package is easiest and quick and was a quick solution for this. I ran it for my 280 species, and a minute later had the matches. I am excited to see the modifications I can do with this, like adding search terms as suggested or extracting article counts at different time intervals (publications at 1980 vs 2010 for instance).

One question with using CrossRef -> are matches text matches or individual papers? For instance if “Artibeus fimbriatus” is in the title then multiple times in a papers abstract, would this be counted multiple times even though only one publication?

arw36 · March 9, 2017, 6:48pm

Also, I am a bit concerned with the specificity of cross ref. For instance a search for Poa annua and “Poa annua” output the same counts, and pick up matches for just Poa or annua, rather than items with Poa and annua/

sckott · March 10, 2017, 2:27am

can you give an example?

which of these do you really want?

See https://github.com/CrossRef/rest-api-doc/blob/master/rest_api.md#parameters which for the query parameter lists

limited DisMax query terms

Which links to DisMax - Solr - Apache Software Foundation and therein to The DisMax Query Parser | Apache Solr Reference Guide 6.6

I don’t know what specifically limited Dismax means, but i’d just go with whats documented on The DisMax Query Parser | Apache Solr Reference Guide 6.6

see also https://github.com/CrossRef/rest-api-doc/blob/master/rest_api.md#queries

arw36 · March 10, 2017, 6:00pm

Ideally I want the intersection of both the genus AND the species to get publication count. I think it is an issue with how CrossRef does queries, using that limited DisMax method. To solve, I modified the function you provided to run separately for “Genus species”, “Genus”, “species”

hostname <- as.vector(host$hHostNameIUCN)
genus <- as.vector(host$hGenus)
species <- as.vector(host$hSpecies)
species_cr_search <- function(x,y,z, ...) {
  data.frame(
    hostname = x, 
    hmatches = cr_works(query = x, limit = 0, ...)$meta$total_results,
    genus = y, 
    gmatches = cr_works(query = y, limit = 0, ...)$meta$total_results,
    species = z, 
    smatches = cr_works(query = z, limit = 0, ...)$meta$total_results,
    stringsAsFactors = FALSE
  )
}

Then, from the outputed matrix modified to remove duplicates or publications for only one search term, rather than both.

researcheffort <- mapply(species_cr_search, x=hostname, y=genus, z=species)
researcheffort <- t(researcheffort) #flip matrix
rownames(researcheffort) <- c() # get rid of species rownames
researcheffort <- as.data.frame(researcheffort)
researcheffort$publications <- as.numeric(researcheffort$gmatches) + as.numeric(researcheffort$smatches) - as.numeric(researcheffort$hmatches)

What I end up with is a dataframe with separate search terms, then what I believ is the actual publications

Maybe this is a bit round about, but I couldnot query multiple times within the cr_works function.

sckott · March 10, 2017, 10:51pm

You can also try fulltext - there are some defaults set for crossref in particular (that you can change) - so keep that in mind

install from dev version

devtools::install_github("ropensci/fulltext")

Taxa list

library(fulltext)
taxa <- c("Acerodon celebensis", "Anoura caudifer", 
          "Anoura geoffroyi", "Antrozous pallidus")

With Entrez

(taxa1_entrez <- ft_search(query = taxa[1], from = 'entrez', limit = 100))
#> Query:
#>   [Acerodon celebensis] 
#> Found:
#>   [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 7; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0] 
#> Returned:
#>   [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 7; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0] 

(taxa3_entrez <- ft_search(query = taxa[3], from = 'entrez', limit = 100))
#> Query:
#>   [Anoura geoffroyi] 
#> Found:
#>   [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 43; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0] 
#> Returned:
#>   [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 43; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0]

With Scopus

(taxa1_scopus <- ft_search(query = taxa[1], from = 'scopus', limit = 100,
    scopusopts = list(key = Sys.getenv('ELSEVIER_SCOPUS_KEY'))))
#> Query:
#>   [Acerodon celebensis] 
#> Found:
#>   [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 10; Microsoft: 0] 
#> Returned:
#>   [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 10; Microsoft: 0] 

(taxa3_scopus <- ft_search(query = taxa[3], from = 'scopus', limit = 100,
    scopusopts = list(key = Sys.getenv('ELSEVIER_SCOPUS_KEY'))))
#> Query:
#>   [Anoura geoffroyi] 
#> Found:
#>   [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 143; Microsoft: 0] 
#> Returned:
#>   [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 100; Microsoft: 0]

arw36 · March 15, 2017, 8:45pm

Thanks @sckott . This is definitely the package I was looking for, especially because of it utilizes rentrez to get pubmed results.

library(rentrez)
taxa <- c("Acerodon celebensis", "Anoura caudifer", 
          "Anoura geoffroyi", "Antrozous pallidus")
species_pubmed_search <- function(x,...) {
  entrez_search(db= "pubmed", term = x)
}
taxa_ft <- lapply(taxa, species_pubmed_search)
ft_effort <- stringi::stri_list2matrix(taxa_ft, byrow=TRUE) %>% as.data.frame()
ft_effort <- select(ft_effort, V2)
taxa_df <- as.data.frame(taxa)
taxa_df <- bind_cols(taxa_df, ft_effort)
taxa_df$V2 <- as.numeric(levels(taxa_df$V2))[taxa_df$V2]

V2 = pubmed hits

sckott · March 15, 2017, 8:54pm

@arw36 Great! glad it works for you.

Love any feedback on that package

Topic		Replies	Views
Searching Microsoft Academic & extracting the metadata UseCases r , package , metadata , microdemic	0	1382	March 27, 2020
Recommended R package (or tools) to facilitate possible search strings in systematic Literature Search (systematic literature review?) Package Use Questions	4	3859	October 9, 2019
Use of some ropensci packages on GitHub UseCases	3	3735	April 12, 2015
Academic departments as networks: test case for package 'fulltext' UseCases	5	2266	January 7, 2015
rgbif: get a little or alot of occurrence data UseCases rgbif , gbif , community-call	0	1517	April 2, 2019

Possible new pkg idea: publication bias

Related topics