data sources
here’s a breakdown of what I know for the data sources you’re using:
- PubMed/Medline: as far as I know, Medline is the same as Pumbed (see the heading on this page PubMed). These are available in the fulltext package under the name
entrez
(NCBI’s name for their webservice that allows access to Pubmed/Medline)
- CINAHL: there’s currently no R package that gives access to this. I’ve asked my university librarians about this.
- Scopus: Available in the fulltext package
authentication
Pubmed is open, no authentication required.
Scopus on the other hand requires jumping through some hoops. From the ?fulltext-package
manual page:
Scopus requires two things: an API key and your institution must have access. For the API key, go to Elsevier Developer Portal, register for an account, then when you’re in your account, create an API key. Pass in as variable key
to scopusopts
, or store your key under the name ELSEVIER_SCOPUS_KEY
as an environment variable in .Renviron
, and we’ll read it in for you. See ?Startup
in R for help. For the institution access go to a browser and see if you have access to the journal(s) you want. If you don’t have access in a browser you probably won’t have access via this package. If you aren’t physically at your institution you will likely need to be on a VPN or similar so that your IP address is in the range that the two publishers are accepting for that institution.
searching
Best to start with searching, here using examples with entrez, but same applies for Scopus (but requires the authentication above):
res <- ft_search(query='ecology', from='entrez')
res
#> Query:
#> [ecology]
#> Found:
#> [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 180481; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0]
#> Returned:
#> [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 10; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0]
You can index into the Entrez results to get a data.frame:
res$entrez
#> Query: [ecology]
#> Records found, returned: [180481, 10]
#>
#> uid pubdate epubdate printpubdate source volume issue pages
#> 1 6783310 2019 Jul 9 2019 Jul 9 2019 Oct 1 Ecology 100 10 e02794
#> 2 6783302 2019 Apr 3 2019 Apr 3 Sci Transl Med 11 486 eaav0537
#> 3 6781247 2018 2018 Environ Model Softw 109 93-103
#> 4 6781240 2017 2017 Estuaries Coast 41 2 404-420
#> 5 6781235 2018 2018 Hydrobiologia 818 1 71-86
#> 6 6781228 2018 2018 J Am Water Works Assoc 110 11 64-68
#> 7 6773173 2000 Nov 15 2000 Nov 15 J Neurosci 20 22 8533-8541
#> 8 6779586 2015 Dec 24 2015 Dec 24 2016 Feb J Exp Zool A Ecol Genet Physiol 325 2 106-115
#> 9 6778798 2019 Aug 8 2019 Aug 8 G3 (Bethesda) 9 10 3181-3199
#> 10 6778791 2019 Aug 7 2019 Aug 7 G3 (Bethesda) 9 10 3249-3262
#> Variables not shown: fulljournalname (chr), sortdate (chr), pmclivedate (chr), pmid (chr), doi (chr), pmcid (chr), mid
#> (chr), title (chr), authors (chr)
Then you can go to ft_get
:
# get articles, writes the XML files to your computer
out <- ft_get(res)
# ft_collect gathers and parses the XML and puts it in the output
out <- ft_collect(out)
# then access the XML full text, e.g., for 1 articles
out$entrez$data$data$`6783310`
You can use another package pubchunks to help pull out the parts of the articles you want from the XML, unless you are comfortable dealing with XML yourself.
There’s a fulltext function for abstracts specifically, but not for Entrez, see ?ft_abstract
citations
for citations you can use rcrossref
# pass in the DOIs from the previous search output
# you can request various citation formats, including bibtex
z <- cr_cn(res$entrez$data$doi, format = "bibtex")
z[[1]]
#> [1] "@article{Loreau_2019,\n\tdoi = {10.1002/ecy.2794},\n\turl = {https://doi.org/10.1002%2Fecy.2794},\n\tyear = 2019,\n\tmonth = {jul},\n\tpublisher = {Wiley},\n\tvolume = {100},\n\tnumber = {10},\n\tauthor = {Michel Loreau and Andy Hector},\n\ttitle = {Not even wrong: Comment by Loreau and Hector},\n\tjournal = {Ecology}\n}"