[fulltext] Is there a way to restrict output by year of publication?


#1

I’m interested in limiting search results by year of publication, this can be either after searching, or by limiting the output of ft_get by year. I don’t see an option to extract year using chunks or ft_chunk, except using the “history” marker, which sometimes is NA.

This isn’t completely necessary, but would help narrow down searches to periods of interest (say post 2010). Any thoughts or suggestions would be greatly appreciated!


#2

Thanks for your question

The best way to approach that is I think on the search side of things, with ft_search

Unfortunately, each data source has different search interfaces, so you have to look into each one. (:thinking: we could try to make a harmonized programmatic user interface to common search things like dates)

PLOS

res1 <- ft_search(query='climate change', from='plos', limit=500, 
  plosopts = list(
    fl = c('id','publication_date'),
    fq = list('publication_date:[2010-01-01T00:00:00Z TO 2012-01-01T00:00:00Z]')
  )
)
res1
res1$plos
summary(as.Date(res1$plos$data$publication_date))
#>        Min.      1st Qu.       Median         Mean      3rd Qu.         Max.
#> "2010-01-05" "2010-10-25" "2011-04-28" "2011-03-23" "2011-09-17" "2011-12-29"

Entrez

res2 <- ft_search(query='climate change', from='entrez', limit=500, 
  entrezopts = list(mindate = "2010/01/01", maxdate = "2012/01/01")
)
res2
res2$entrez
summary(as.numeric(stringr::str_extract(res2$entrez$data$pubdate, "[0-9]{4}")))
#>     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>    2011    2011    2011    2011    2011    2012

Crossref

res3 <- ft_search(query='climate change', from='crossref', limit=500, 
  crossrefopts = list(filter = 
    list(from_created_date = "2010-01-01", until_created_date = "2012-01-01"))
)
res3
res3$crossref
summary(as.Date(res3$crossref$data$created))
#>         Min.      1st Qu.       Median         Mean      3rd Qu.         Max.
#> "2010-01-05" "2010-07-17" "2010-12-08" "2010-12-24" "2011-06-13" "2011-12-29"

arXiv

res4 <- ft_search(query='climate change AND submittedDate:[201001010000 TO 201201010000]', 
    from='arxiv', limit=500)
res4
res4$arxiv
summary(as.Date(res4$arxiv$data$submitted))
#>         Min.      1st Qu.       Median         Mean      3rd Qu.         Max.
#> "2010-01-01" "2010-01-11" "2010-01-20" "2010-01-19" "2010-01-28" "2010-02-05"

biorxiv

I’m not convinced this date searching with biorxiv actually works, but it might be working. :confused:

res5 <- ft_search(query='climate change', from='biorxiv', limit=10, 
    biorxivopts = list(date_from = "2017-01-01", date_to = "2018-01-01", verbose = TRUE)
)
res5
res5$biorxiv
summary(as.Date(res5$biorxiv$data$created))
#>         Min.      1st Qu.       Median         Mean      3rd Qu.         Max.
#> "2016-10-20" "2017-07-02" "2017-07-22" "2017-07-03" "2017-08-08" "2017-10-10"

Microsoft Academic

see https://docs.microsoft.com/en-us/azure/cognitive-services/academic-knowledge/queryexpressionsyntax for query expresion syntax

## you'll need the dev version for this e.g for Microsoft to work
## devtools::install_github("ropensci/fulltext")
res6 <- ft_search(query='Y=[2010, 2012)', from='microsoft', limit=500)
res6
res6$ma
summary(res6$ma$data$Y)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>   2010    2010    2010    2010    2011    2011

Europmc

res6 <- ft_search(query='climate change (FIRST_PDATE:[2010-01-01+TO+2012-01-01])', from='europmc', limit=10)
res6
res6$europmc
summary(as.numeric(res6$europmc$data$pubYear))
#>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>   2011    2011    2012    2012    2012    2012

Others