Using pdftools, tabulizer, and writexl to simplify business information handling workflow

thoughtfulnz · August 27, 2019, 11:20pm

Providing a script for a workflow to liberate the contents of PDFs

Packages used

Code Snippet

They have a script which, when they hit the source button, opens a select file dialog box and they pick the pdf, then it saves a word document with the full text, and an excel file of one sheet per table, in the same directory. they also have the option of coming on some R courses I regularly run if they want to start working with the information inside of R.

library(officer)
library(pdftools)
library(tabulizer) #needs JDK installed
library(writexl)

target <- file.choose()
target_no_suffix <- gsub("\\.pdf$","", target, ignore.case = TRUE)
PDFtext <- pdf_text(target) #pdf_tools
#a character array for each page, want paragraphs for officer
PDFparas <- unlist(strsplit(PDFtext, "\n"))
PDFextracted <-  extract_tables(target) #tabulizer
# 1 character matrix per table, want dataframes for writexl
PDFtables <- lapply(extracted,as.data.frame, stringsAsFactors=FALSE)
write_xlsx(PDFtables, paste0(target_no_suffix, "_excel.xlsx"), col_names=FALSE) #writexl
#officer word making code
my_doc <- read_docx()
for (scannedpage in PDFparas){
  my_doc <- body_add_par(my_doc, scannedpage)
}
print(my_doc, target = paste0(target_no_suffix,"_word.docx"))

Sector

industry

Field(s) of application

workflows, compliance, document handling

Twitter handle

thoughtfulnz

What did you do?

The finance people where I work told me they have a workflow bottleneck with receiving statements as pdfs and being unable to free up the information in them to make checking easier as Acrobat’s save as Excel can’t really cope with the kind they are receiving and, at a bare minimum, being able to sort the information on the statements into a convenient order would save them at hundreds of hours a year (at the moment they are printing them out and then cross checking and ticking off the paper copy).

This is an example of introducing R to a previously unused area, not as a replacement for Excel, but to solve a specific workflow bottleneck. R is being their helpful friend within their present needs.

sckott · August 29, 2019, 5:43pm

Thanks for sharing this @thoughtfulnz

How often do you run into pdfs that cause errors/problems?

thoughtfulnz · August 30, 2019, 3:36am

In this particular case no significant number- pretty much everything comes through the same fax received as PDF and OCRed pathway, and because spots in the fax can be interpreted as decimal points, the output is not automatically trusted, it is for freeing up the data to make manual checking easier. If the on-page arrangement is a bit much for tabulizer, then the pdftools text extraction is a fall-back, and they still have the original PDF.

sckott · August 30, 2019, 4:57pm

That’s good to hear that there’s no big failures - sounds like a really nice solution

Topic		Replies	Views
PDF Extraction in R Blog r	3	604	June 25, 2018
Extraindo tabelas de documentos pdf em R com Tabulizer UseCases tabulizer , portuguese , português	0	1488	July 20, 2020
pdftools for extracting complex (e.g. text-wrapped/multiline) tables from pdfs UseCases r , pdftools , tidyverse	0	2178	January 26, 2021
pdftools for parsing tables from many .pdfs UseCases package , pdftools	1	1967	March 31, 2020
tabulizer for parsing block-text from .pdf UseCases package , tabulizer	1	1413	February 1, 2020