Filtering pdfs using RegEx in their body

rOpenSci package or resource used*

pdftools

What did you do?

I used pdftools to scan a large number of pdfs from a folder, and then stringr::str_which() to search the body of each pdf for any RegEx expression you want. Pdfs with matching RegEx expressions are saved in a new folder of your choosing.

URL or code snippet for your use case*

pdf_selector<-function(file_folder,new_folder,search_pattern){
library(pdftools)
library(tidyverse)
setwd(file_folder)
dir.create(new_folder)
filenames<-list.files(file_folder,pattern="*.pdf",full.names=TRUE)
pdfs<-lapply(filenames,pdf_text)%>%
lapply(paste,sep=" “,collapse=” ")
l<-lapply(pdfs,str_which,pattern=search_pattern)%>%
as.logical()%>%
replace_na(FALSE)
sapply(filenames[l],file.copy,to=new_folder)
}

#file_folder is the directory to folder containing the pdfs you want to search, as a string
#new_folder is the name of the new folder you want to make with your filtered pdfs, as a string
#search_pattern is the RegEx expression you want to search for, as a string
#Make sure to set your working directory to contain the pdf-containing folder, or include the full directory path in your file_folder argument

Sector

academic

Field(s) of application

social science, qualitative science, meta-analysis

Comments

Twitter handle

@AugustusPendle1

1 Like