Pdftools 2.0: powerful pdf text extraction tools

A new version of pdftools has been released to CRAN. Go get it while it’s hot:


This version has two major improvements: low level text extraction and encoding improvements. Read more: https://ropensci.org/technotes/2018/12/14/pdftools-20/


This is a wonderful R package, thank you!

One question, can anyone tell me what the “space” column value of TRUE or FALSE means precisely, when using the pdf_data function? I haven’t been able to locate any information on this searching on poppler, pdftools, etc. …

Actually I’m not entirely sure. From google I found:

hasSpaceAfter() will tell you the end of line when returning False .

So that may be it.

Thank you. I thought that this was the case (namely for a set of common “y” coordinate-valued rows forming a line, the maximum x value (rightmost word) would have space == FALSE). But I do get exceptions where common y-values have more than one FALSE value for “space”. Which leads me to think that the y-coordinate value cannot be thought of as a “line” strictly – or the “space” logical value signifies something more subtle?

I’ll search “hasSpaceAfter” for more information, thank you :slight_smile:

Hi any tips on how to transform the pdf_data() output into the original “table-like” structure?