pdftools for parsing .pdf from a URL - public data mining

leungi · February 15, 2020, 5:05pm

rOpenSci package or resource used*

URL or code snippet for your use case*

library(httr)  # will be use to make HTML GET and POST requests
library(rvest) # will be used to parse HTML
library(tidyverse) # data munge
library(pdftools) # parse pdf

# |- data ----
library(httr)  # will be use to make HTML GET and POST requests
library(rvest) # will be used to parse HTML
library(tidyverse) # data munge
library(pdftools) # parse pdf

# |- data ----
url <- "https://www.bsee.gov/guidance-and-regulations/guidance/safety-alerts-programs"

r <- GET(url)

raw_tbl <- r %>% 
  content() %>% 
  html_node("table") %>%
  html_table() %>% 
  as_tibble() %>% 
  janitor::clean_names()

raw_tbl

# |- munge ----
normalize_title <- function(title) {
  title %>% 
    tolower() %>% 
    str_replace_all("\\b\\s\\b", "\\-") %>% 
    str_remove_all("\\s")
}

clean_tbl <- raw_tbl %>% 
  mutate(norm_url = map_chr(title, normalize_title))

clean_tbl$norm_url[[1]]

base_url <- "https://www.bsee.gov/sites/bsee.gov/files/safety-alerts//"

ReadPDF <- function(base_url, pdf_url) {
  pdf_url <- glue::glue(base_url, {pdf_url}, ".pdf")
  
  print(pdf_url)
  
  # pdftools time!
  pdf_text(pdf_url)
}

SafeReadPDF <- possibly(ReadPDF, NA)

SafeReadPDF(base_url, clean_tbl$norm_url[[1]])

Image

pdftools-demo

Sector

Energy.

Field(s) of application

Energy.

What did you do?

Goal: improve safety performance by enabling quicker access to public safety alerts, and extract insights for decision-makers.

Extracted data from pdftools may then be analyzed downstream (e.g., huggingface NLP) and served via web app/API (e.g., shiny, plumber)

Comments

Having rOpenSci is a blessing

Twitter handle

@urganmax

Topic		Replies	Views
tabulizer for parsing block-text from .pdf UseCases package , tabulizer	1	1412	February 1, 2020
pdftools + map to download & read multiple pdfs UseCases pdftools , purrr	0	1670	July 15, 2021
Filtering pdfs using RegEx in their body UseCases pdftools , tidyverse , stringr	0	1497	June 25, 2020
pdftools for parsing tables from many .pdfs UseCases package , pdftools	1	1965	March 31, 2020
PDF Extraction in R Blog r	3	593	June 25, 2018