pdftools for extracting complex (e.g. text-wrapped/multiline) tables from pdfs

lizlaw · January 26, 2021, 4:54pm

Extracting a complex table from pdf using pdftools::pdf_data. Example uses a table spread over multiple pages, and containing multiple(text-wrapped) lines per cell, and left and centre justified cell entries.

rOpenSci package or resource used*

pdftools

What did you do?

Extraction of complex tables from a pdf document, based on data extracted by pdftools::pdf_data(). Example has a table spread over multiple pages, and text wrapping across multiple lines per cell.

Process is semi-automated: requires user to input, e.g. rules to clip text to the table, and identify useful points for identification of columns and rows. Code currently developed as an R script (including functions) with the intention to develop further into a package when I have time (or integrated into someone else’s), see the readme file in the linked repository for updates.

URL or code snippet for your use case*

pdf2complextable

Sector

academic / industry / government / non-profit / other

Field(s) of application

Evidence synthesis, meta-analysis, data gathering in any discipline. Example is from ecology.

Comments

Feel free to suggest features on the github link.

Topic		Replies	Views
pdftools for parsing tables from many .pdfs UseCases package , pdftools	1	1966	March 31, 2020
PDF Extraction in R Blog r	3	595	June 25, 2018
tabulizer for parsing block-text from .pdf UseCases package , tabulizer	1	1413	February 1, 2020
pdftools for parsing .pdf from a URL - public data mining UseCases package , pdftools	0	1622	February 15, 2020
Pdftools 2.0: powerful pdf text extraction tools Blog	13	1945	December 5, 2021