Transform article XML into ft_data

Hello

Let’s say I have a bunch of XML files returned from Scopus API and I want to harness fulltext R package capabilities, namely the ft_chunk function.

Is there I way I can transform my XML into an ft_data object so as to use ft_collect() %>% ft_chunks()?

Disclaimer: the base assumption is that ft_get is not able to retrieve these XML in question :slight_smile:

Kind regards

1 Like

good question!

do you specifically want the articles in the ft_data object, or is that only so you can use ft_chunks?

I’ve been working on https://github.com/ropensci/pubchunks - extracting the ft_chunks and related tools out of fulltext so they can be used outside of fulltext (but will be used in fulltext as well)

it’s not on cran yet, but please do try it. here’s an example:

remotes::install_github("ropensci/pubchunks")
x <- system.file("examples/10_1016_0021_8928_59_90156_x.xml", 
  package = "pubchunks")
y <- system.file("examples/10_1016_s1569_1993_15_30039_4.xml", 
  package = "pubchunks")
z <- list(x, y)
pub_chunks(z, "abstract")
pub_chunks(z, "title")
pub_chunks(z, "authors")
pub_chunks(z, c("abstract", "title", "refs"))

the general idea here follows from what ft_chunks was doing but just make it more general - and add more parsers for more publishers and section types

let me know if it works