Pdftools 2.0: powerful pdf text extraction tools

A new version of pdftools has been released to CRAN. Go get it while it’s hot:

install.packages("pdftools")

This version has two major improvements: low level text extraction and encoding improvements. Read more: https://ropensci.org/technotes/2018/12/14/pdftools-20/

2 Likes

This is a wonderful R package, thank you!

One question, can anyone tell me what the “space” column value of TRUE or FALSE means precisely, when using the pdf_data function? I haven’t been able to locate any information on this searching on poppler, pdftools, etc. …

1 Like

Actually I’m not entirely sure. From google I found:

hasSpaceAfter() will tell you the end of line when returning False .

So that may be it.

1 Like

Thank you. I thought that this was the case (namely for a set of common “y” coordinate-valued rows forming a line, the maximum x value (rightmost word) would have space == FALSE). But I do get exceptions where common y-values have more than one FALSE value for “space”. Which leads me to think that the y-coordinate value cannot be thought of as a “line” strictly – or the “space” logical value signifies something more subtle?

I’ll search “hasSpaceAfter” for more information, thank you :slight_smile:

Hi any tips on how to transform the pdf_data() output into the original “table-like” structure?

Error with pdf_data()

item_dt <- pdf_data(pdf)[[7]]
Error in normalizePath(pdf, mustWork = TRUE) : 
  path[1]="                     Federal, State, and Local Governments
                           2017 State and Local Government Finances
                                        Technical Documentation
Individual Unit Data File (Public Use Format)
This is an ASCII fixed length text file. It contains amount for each finance item code within each
government unit for all respondents and non-respondents in the sample. This large file can be useful
for programming and database applications.
For 2017, the file name is 2017FinEstDAT_02202020modp_pu.txt and contains a standard 34-
character public-use format record layout. It is about 59 megabytes. Below is a detailed record
layout for the file.

This happens with every page, what does it mean? Thanks!

Sorry for the delay. Can you be more specific? What do you mean by the original table like structure? Can you give an example?

pdf_data expects a file path or raw vector. It looks like you probably passed in a character string instead, that is, your variable pdf is probably a string, correct? try passing a file path instead

Using pdftools::pdf_data, I’ve written a short script with a bunch of functions to help semi-automate extraction of complex tables (in my case tables with multiple lines per cell, spread over multiple pdf pages). The same process should work for any table. It is currently publically available on my github GitHub - lizlaw/pdf2complextable: Script to extract a complex table (here containing multiple lines per cell) from a pdf

2 Likes

Hi @lizlaw!

(rOpenSci Community Assistant here) Cool use of the pdftools package by Jeroen Ooms!

Would you consider adding this use case (description and code snippet or link to code/post) to the use case forum?

discuss.ropensci.org/c/usecases/

There’s a template to help & we tweet to help share applications of rOpenSci pkgs!

Done - thanks for the suggestion!

2 Likes

I first saw this pdftools and loved it, but had no idea how to use. Should have posted this solution using {pdftools} to locate exact coordinates to extract tables with {tabilizer} earlier, better late then never. Tabulizer and pdftools Together as Super-powers - Part 2 - Redwall Analytics

1 Like

This is not an issue or a suggestion. I just wanted to say this package has saved me hours of work. Thank you for all the effort. It really makes a difference.

1 Like

@SivuyileNzimeni Thank you so much for taking the time to share your appreciation!