Pdftools 2.0: powerful pdf text extraction tools - Blog

March 2019

michael.mastroianni

This is a wonderful R package, thank you!

One question, can anyone tell me what the “space” column value of TRUE or FALSE means precisely, when using the pdf_data function? I haven’t been able to locate any information on this searching on poppler, pdftools, etc. …

March 2019

jeroenooms

Actually I’m not entirely sure. From google I found:

hasSpaceAfter() will tell you the end of line when returning False .

So that may be it.

1 reply

March 2019 ▶ jeroenooms

michael.mastroianni

Thank you. I thought that this was the case (namely for a set of common “y” coordinate-valued rows forming a line, the maximum x value (rightmost word) would have space == FALSE). But I do get exceptions where common y-values have more than one FALSE value for “space”. Which leads me to think that the y-coordinate value cannot be thought of as a “line” strictly – or the “space” logical value signifies something more subtle?

I’ll search “hasSpaceAfter” for more information, thank you

June 2019

Eric

Hi any tips on how to transform the pdf_data() output into the original “table-like” structure?

1 reply

July 2020

erica-grabowski

Error with pdf_data()

item_dt <- pdf_data(pdf)[[7]]
Error in normalizePath(pdf, mustWork = TRUE) : 
  path[1]="                     Federal, State, and Local Governments
                           2017 State and Local Government Finances
                                        Technical Documentation
Individual Unit Data File (Public Use Format)
This is an ASCII fixed length text file. It contains amount for each finance item code within each
government unit for all respondents and non-respondents in the sample. This large file can be useful
for programming and database applications.
For 2017, the file name is 2017FinEstDAT_02202020modp_pu.txt and contains a standard 34-
character public-use format record layout. It is about 59 megabytes. Below is a detailed record
layout for the file.

This happens with every page, what does it mean? Thanks!

1 reply

July 2020 ▶ Eric

sckott Leader

Sorry for the delay. Can you be more specific? What do you mean by the original table like structure? Can you give an example?

July 2020 ▶ erica-grabowski

sckott Leader

pdf_data expects a file path or raw vector. It looks like you probably passed in a character string instead, that is, your variable pdf is probably a string, correct? try passing a file path instead

January 2021

lizlaw

Using pdftools::pdf_data, I’ve written a short script with a bunch of functions to help semi-automate extraction of complex tables (in my case tables with multiple lines per cell, spread over multiple pdf pages). The same process should work for any table. It is currently publically available on my github GitHub - lizlaw/pdf2complextable: Script to extract a complex table (here containing multiple lines per cell) from a pdf

1 reply

January 2021 ▶ lizlaw

steffilazerte

Hi @lizlaw!

(rOpenSci Community Assistant here) Cool use of the pdftools package by Jeroen Ooms!

Would you consider adding this use case (description and code snippet or link to code/post) to the use case forum?

discuss.ropensci.org/c/usecases/

There’s a template to help & we tweet to help share applications of rOpenSci pkgs!

1 reply

January 2021 ▶ steffilazerte

lizlaw

Done - thanks for the suggestion!

May 2021

luceyda

I first saw this pdftools and loved it, but had no idea how to use. Should have posted this solution using {pdftools} to locate exact coordinates to extract tables with {tabilizer} earlier, better late then never. Tabulizer and pdftools Together as Super-powers - Part 2 - Redwall Analytics

December 2021

SivuyileNzimeni

This is not an issue or a suggestion. I just wanted to say this package has saved me hours of work. Thank you for all the effort. It really makes a difference.

December 2021

stefanie

@SivuyileNzimeni Thank you so much for taking the time to share your appreciation!