pdftools for extracting complex (e.g. text-wrapped/multiline) tables from pdfs

Extracting a complex table from pdf using pdftools::pdf_data. Example uses a table spread over multiple pages, and containing multiple(text-wrapped) lines per cell, and left and centre justified cell entries.

rOpenSci package or resource used*

pdftools

What did you do?

Extraction of complex tables from a pdf document, based on data extracted by pdftools::pdf_data(). Example has a table spread over multiple pages, and text wrapping across multiple lines per cell.

Process is semi-automated: requires user to input, e.g. rules to clip text to the table, and identify useful points for identification of columns and rows. Code currently developed as an R script (including functions) with the intention to develop further into a package when I have time (or integrated into someone else’s), see the readme file in the linked repository for updates.

URL or code snippet for your use case*

pdf2complextable

Sector

academic / industry / government / non-profit / other

Field(s) of application

Evidence synthesis, meta-analysis, data gathering in any discipline. Example is from ecology.

Comments

Feel free to suggest features on the github link.

3 Likes