PDF Extraction in R

I want to extract Age,Name,Academic qualifications from a given set of PDF Resume’s available into a spreadsheet document using R. Please help if this can be achieved using pdftools package or any other such package.

you can use pdftools e.g., https://github.com/ropensci/pdftools#limitations - but you’ll have to then parse the tables yourself somehow.

another ropensci tool is tabulizer - though it does depend on Java, so can be a pain to install depending on the system

2 Likes

Hello sckott

If you can provide me with any source code to parse data and read pdf files.

Curious if you’ve tried anything yet? Have you seen his blog post https://ropensci.org/blog/2016/03/01/pdftools-and-jeroen/ or the reference manual https://cran.rstudio.com/web/packages/pdftools/pdftools.pdf . Here’s an example of getting tables out of pdfs http://www.brodrigues.co/blog/2018-06-10-scraping_pdfs/ with pdftools