tabulizer for parsing block-text from .pdf

leungi · February 1, 2020, 7:41pm

rOpenSci package or resource used*

URL or code snippet for your use case*

Goal: extract certain block of data from different sections of the .pdf

Strategy: use split-apply-combine approach via locate_areas() (to get box coordinates of sections of interest) and then extract_tables() to get the data with the section.

locate_areas() demo
tabulizer-demo

Code snippet

# |- fxn ----
# data munge function for map() later
CleanHeader <- function(tbl) {
  tbl %>%
    tidyr::pivot_longer(
      cols = -idx
    ) %>%
    dplyr::group_by(name) %>%
    dplyr::filter(value != "") %>%
    dplyr::summarise(value = paste0(value, collapse = ":")) %>%
    tidyr::separate(value, c("cat", "data"), sep = ":", extra = "merge") %>%
    dplyr::mutate_at(vars(data), stringr::str_remove, ":") %>% 
    dplyr::select(cat, data)
}

# |- data ----
file <- "./ropensci_white.pdf"

# get the box coordinates via interactive selection; this info is then used in extract_tables() area args
locate_areas(file, pages = 1)

# extract data
header_raw <- extract_tables(f, pages = 1,
                             area = list(c(74, 88, 128, 522)),
                             guess = FALSE)

# data munge; data/application specific
header_raw[[1]] %>% 
  tibble::as_tibble() %>% 
  janitor::remove_empty("cols") %>% 
  dplyr::mutate(idx = cumsum(stringr::str_detect(V1, "Section Marker"))) %>% 
  dlpyr::group_split(idx) %>% 
  purrr::map_dfr(CleanHeader)

Image

Sample .pdf where coloured boxes represent sections of interest to extract data from.

See follow up post below for picture (new users may only post 1 image/post )

Sector

Energy.

Field(s) of application

Energy.

What did you do?

Use {tabulizer} from rOpenSci to extract data of interest from .pdf to save time and avoid data quality issue that may be introduced if it was done manually.

Comments

I {tabulizer}

Twitter handle

@urganmax

leungi · February 1, 2020, 7:42pm

Topic		Replies	Views
pdftools for extracting complex (e.g. text-wrapped/multiline) tables from pdfs UseCases r , pdftools , tidyverse	0	2174	January 26, 2021
Extraindo tabelas de documentos pdf em R com Tabulizer UseCases tabulizer , portuguese , português	0	1488	July 20, 2020
pdftools for parsing tables from many .pdfs UseCases package , pdftools	1	1966	March 31, 2020
PDF Extraction in R Blog r	3	597	June 25, 2018
Using tabulizer to extract tabular data from daily COVID-19 reports UseCases tabulizer , pdf	0	1029	June 7, 2021