Pdftools 2.0: powerful pdf text extraction tools

jeroenooms · December 14, 2018, 9:54am

A new version of pdftools has been released to CRAN. Go get it while it’s hot:

install.packages("pdftools")

This version has two major improvements: low level text extraction and encoding improvements. Read more: https://ropensci.org/technotes/2018/12/14/pdftools-20/

michael.mastroianni · March 8, 2019, 2:38pm

This is a wonderful R package, thank you!

One question, can anyone tell me what the “space” column value of TRUE or FALSE means precisely, when using the pdf_data function? I haven’t been able to locate any information on this searching on poppler, pdftools, etc. …

jeroenooms · March 8, 2019, 4:33pm

Actually I’m not entirely sure. From google I found:

hasSpaceAfter() will tell you the end of line when returning False .

So that may be it.

michael.mastroianni · March 8, 2019, 5:03pm

Thank you. I thought that this was the case (namely for a set of common “y” coordinate-valued rows forming a line, the maximum x value (rightmost word) would have space == FALSE). But I do get exceptions where common y-values have more than one FALSE value for “space”. Which leads me to think that the y-coordinate value cannot be thought of as a “line” strictly – or the “space” logical value signifies something more subtle?

I’ll search “hasSpaceAfter” for more information, thank you

Eric · June 13, 2019, 12:55pm

Hi any tips on how to transform the pdf_data() output into the original “table-like” structure?

erica-grabowski · July 29, 2020, 3:43pm

Error with pdf_data()

item_dt <- pdf_data(pdf)[[7]]
Error in normalizePath(pdf, mustWork = TRUE) : 
  path[1]="                     Federal, State, and Local Governments
                           2017 State and Local Government Finances
                                        Technical Documentation
Individual Unit Data File (Public Use Format)
This is an ASCII fixed length text file. It contains amount for each finance item code within each
government unit for all respondents and non-respondents in the sample. This large file can be useful
for programming and database applications.
For 2017, the file name is 2017FinEstDAT_02202020modp_pu.txt and contains a standard 34-
character public-use format record layout. It is about 59 megabytes. Below is a detailed record
layout for the file.

This happens with every page, what does it mean? Thanks!

sckott · July 29, 2020, 4:29pm

Sorry for the delay. Can you be more specific? What do you mean by the original table like structure? Can you give an example?

sckott · July 29, 2020, 4:31pm

pdf_data expects a file path or raw vector. It looks like you probably passed in a character string instead, that is, your variable pdf is probably a string, correct? try passing a file path instead

lizlaw · January 25, 2021, 7:47am

Using pdftools::pdf_data, I’ve written a short script with a bunch of functions to help semi-automate extraction of complex tables (in my case tables with multiple lines per cell, spread over multiple pdf pages). The same process should work for any table. It is currently publically available on my github GitHub - lizlaw/pdf2complextable: Script to extract a complex table (here containing multiple lines per cell) from a pdf

steffilazerte · January 25, 2021, 6:30pm

Hi @lizlaw!

(rOpenSci Community Assistant here) Cool use of the pdftools package by Jeroen Ooms!

Would you consider adding this use case (description and code snippet or link to code/post) to the use case forum?

discuss.ropensci.org/c/usecases/

There’s a template to help & we tweet to help share applications of rOpenSci pkgs!

lizlaw · January 26, 2021, 4:55pm

Done - thanks for the suggestion!

luceyda · May 7, 2021, 3:15pm

I first saw this pdftools and loved it, but had no idea how to use. Should have posted this solution using {pdftools} to locate exact coordinates to extract tables with {tabilizer} earlier, better late then never. Tabulizer and pdftools Together as Super-powers - Part 2 - Redwall Analytics

SivuyileNzimeni · December 3, 2021, 8:22am

This is not an issue or a suggestion. I just wanted to say this package has saved me hours of work. Thank you for all the effort. It really makes a difference.

stefanie · December 5, 2021, 11:37pm

@SivuyileNzimeni Thank you so much for taking the time to share your appreciation!

Topic		Replies	Views
pdftools for extracting complex (e.g. text-wrapped/multiline) tables from pdfs UseCases r , pdftools , tidyverse	0	2177	January 26, 2021
pdftools for parsing tables from many .pdfs UseCases package , pdftools	1	1967	March 31, 2020
Text vs Word xy Coordinate Differences Between pdf_data() and pdf_ocr_data() Package Use Questions r , package , pdftools	0	336	June 13, 2023
PDF Extraction in R Blog r	3	602	June 25, 2018
tabulizer for parsing block-text from .pdf UseCases package , tabulizer	1	1413	February 1, 2020

Pdftools 2.0: powerful pdf text extraction tools

Related topics