Filtering pdfs using RegEx in their body

gus-pendleton-R · June 25, 2020, 4:14pm

rOpenSci package or resource used*

pdftools

What did you do?

I used pdftools to scan a large number of pdfs from a folder, and then stringr::str_which() to search the body of each pdf for any RegEx expression you want. Pdfs with matching RegEx expressions are saved in a new folder of your choosing.

URL or code snippet for your use case*

pdf_selector<-function(file_folder,new_folder,search_pattern){
library(pdftools)
library(tidyverse)
setwd(file_folder)
dir.create(new_folder)
filenames<-list.files(file_folder,pattern="*.pdf",full.names=TRUE)
pdfs<-lapply(filenames,pdf_text)%>%
lapply(paste,sep=" “,collapse=” ")
l<-lapply(pdfs,str_which,pattern=search_pattern)%>%
as.logical()%>%
replace_na(FALSE)
sapply(filenames[l],file.copy,to=new_folder)
}

#file_folder is the directory to folder containing the pdfs you want to search, as a string
#new_folder is the name of the new folder you want to make with your filtered pdfs, as a string
#search_pattern is the RegEx expression you want to search for, as a string
#Make sure to set your working directory to contain the pdf-containing folder, or include the full directory path in your file_folder argument

Sector

academic

Field(s) of application

social science, qualitative science, meta-analysis

Comments

Twitter handle

@AugustusPendle1

Topic		Replies	Views
pdftools + map to download & read multiple pdfs UseCases pdftools , purrr	0	1671	July 15, 2021
pdftools for parsing tables from many .pdfs UseCases package , pdftools	1	1966	March 31, 2020
pdftools for parsing .pdf from a URL - public data mining UseCases package , pdftools	0	1622	February 15, 2020
PDF Extraction in R Blog r	3	597	June 25, 2018
tabulizer for parsing block-text from .pdf UseCases package , tabulizer	1	1413	February 1, 2020