[fulltext] Can't reproduce example from chapter 9 of the manual


#1

Hi, I originally wrote a script that used fulltext v0.18 but now when using v1.0.0 the same code seems to break.

An example of this can be seen trying to reproduce the chunks example from the fulltextmanual (https://ropensci.github.io/fulltext-book/chunks.html):

x <- ft_get('10.1371/journal.pone.0086169', from='plos')

^ this works, but when I run the next line:

x %>% ft_collect %>% ft_chunks(what="authors")

I get the following error:

"Error in UseMethod(“read_xml”): no applicable method for ‘read_xml’ applied to an object of class “NULL”

Any thoughts? I have reverted back to v.0.18 for now so as to use my old code, but it would be nice to use the most recent package version


#2

traceback() gives the following information:

12: xml2::read_xml(q)
11: FUN(X[[i]], …)
10: lapply(x[[i]]$data$data, function(q) {
qparsed <- if (inherits(q, “xml_document”))
q
else xml2::read_xml(q)
get_what(data = qparsed, what, names(x[i]))
})
9: ft_chunks(., what = “authors”)
8: function_list[k]
7: withVisible(function_list[k])
6: freduce(value, _function_list)
5: _fseq(_lhs)
4: eval(quote(_fseq(_lhs)), env, env)
3: eval(quote(_fseq(_lhs)), env, env)
2: withVisible(eval(quote(_fseq(_lhs)), env, env))
1: x %>% ft_collect %>% ft_chunks(what = “authors”)


#3

Thanks @maxfarrell

Can you share your sessionInfo()?


#4

It’s possible ft_collect() isn’t working. Can you make sure that fxn is working. e.g.,

x <- ft_get('10.1371/journal.pone.0086169', from='plos')
x$plos$data$data # this should be NULL
x <- ft_collect(x) 
x$plos$data$data # this should have the text of the article

do you get the same?


#5
x$plos$data$data 

run after ft_get returns NULL as it is supposed to, but returns

$'10.1371/journal.pone.0086169’
NULL

after passing it through ft_collect…

Here is my sessionInfo():

R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] fulltext_1.0.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.14 pillar_1.1.0 compiler_3.4.3 plyr_1.8.4
[5] bindr_0.1 tools_3.4.3 digest_0.6.14 lubridate_1.7.1
[9] gtable_0.2.0 jsonlite_1.5 tibble_1.4.1 rcrossref_0.8.0
[13] aRxiv_0.5.16 pkgconfig_2.0.1 rlang_0.1.6 bibtex_0.4.2
[17] shiny_1.0.5 crul_0.5.0 curl_3.1 bindrcpp_0.2
[21] storr_1.1.3 dplyr_0.7.4 httr_1.3.1 stringr_1.2.0
[25] xml2_1.2.0 rappdirs_0.3.1 grid_3.4.3 glue_1.2.0
[29] R6_2.2.2 rentrez_1.1.0 XML_3.98-1.9 solrium_1.0.0
[33] hoardr_0.2.0 whisker_0.3-2 reshape2_1.4.3 ggplot2_2.2.1
[37] magrittr_1.5 scales_0.5.0 rplos_0.8.0 htmltools_0.3.6
[41] microdemic_0.2.0 assertthat_0.2.0 colorspace_1.3-2 mime_0.5
[45] xtable_1.8-2 httpuv_1.3.5 stringi_1.1.6 miniUI_0.1.1
[49] lazyeval_0.2.1 munsell_0.4.3


#6

Thanks. Can you paste in the output of x$plos after running x <- ft_get('10.1371/journal.pone.0086169', from='plos') - which should show a file path to the file on your machine

(and can you try to put code in code blocks? See https://superuser.com/editing-help for help)


#7
x$plos
$found
[1] 1

$dois
[1] "10.1371/journal.pone.0086169"

$data
$data$backend
[1] "ext"

$data$cache_path
[1] "/home/max/.cache/R/fulltext"

$data$path
$data$path$`10.1371/journal.pone.0086169`
$data$path$`10.1371/journal.pone.0086169`$path
[1] "/home/max/.cache/R/fulltext/10_1371_journal_pone_0086169.xml"

$data$path$`10.1371/journal.pone.0086169`$id
[1] "10.1371/journal.pone.0086169"

$data$path$`10.1371/journal.pone.0086169`$type
[1] "xml"

$data$path$`10.1371/journal.pone.0086169`$error
NULL



$data$data
$data$data$`10.1371/journal.pone.0086169`
NULL



$opts
$opts$doi
[1] "10.1371/journal.pone.0086169"

$opts$type
[1] "xml"

#8

Thanks. That worked as expected. Okay, now do exactly this, and paste in the output of x$plos?

x <- ft_get('10.1371/journal.pone.0086169', from='plos')
x <- ft_collect(x) 
x$plos 

#9
x <- ft_get('10.1371/journal.pone.0086169', from='plos')
x <- ft_collect(x) 
x$plos 

$found
[1] 1

$dois
[1] “10.1371/journal.pone.0086169”

$data
$data$backend
[1] “ext”

$data$cache_path
[1] “/home/max/.cache/R/fulltext”

$data$path
$data$path$10.1371/journal.pone.0086169
$data$path$10.1371/journal.pone.0086169$path
[1] “/home/max/.cache/R/fulltext/10_1371_journal_pone_0086169.xml”

$data$path$10.1371/journal.pone.0086169$id
[1] “10.1371/journal.pone.0086169”

$data$path$10.1371/journal.pone.0086169$type
[1] “xml”

$data$path$10.1371/journal.pone.0086169$error
NULL

$data$data
$data$data$10.1371/journal.pone.0086169
NULL

$opts
$opts$doi
[1] “10.1371/journal.pone.0086169”

$opts$type
[1] “xml”


#10

I think i’ve figured it out, it was a tiny bug, but had a big effect.

reinstall devtools::install_github("ropensci/fulltext") - remember to restart the R session, then try again, let me know if it works or not


#11

Seems to have worked!

$data is no longer NULL - thanks for the quick work on this, I’ll update my code to work with v1.0 and let you know how it goes!


#12

Great, glad it worked. Will push this to CRAN soon so everyone has the fix.