How to process text email data?

I’d like to analyze the R-package-devel Archives. To do that I’ll need to parse “text emails”, in some sort of data.frame with sender, date, object, body.

I’m assuming I’m not the first person ever to do such a thing but I was only able to find the REmail package that interfaces Python email utilities via RPython. Does anyone know of other tools, potentially only using R? Thanks!

1 Like

maybe tm.plugin.mail https://cran.rstudio.com/web/packages/tm.plugin.mail/ haven’t tried it though

1 Like

Thanks, I’ll try it and report back. :slightly_smiling_face:

So it worked very well, thanks @sckott! I’ll try to remember to update this thread when I get a blog post out but in the meantime here are some notes.

How I downloaded all archives

polite


session <- polite::bow("https://stat.ethz.ch/pipermail/r-package-devel/",
            user_agent = "Your own identity do not copy-paste ;o)")

library("magrittr")

polite::scrape(session) %>%
  rvest::xml_nodes("a") %>%
  xml2::xml_attr("href") %>%
  .[grepl("\\.txt\\.gz", .)] -> filenames

fs::dir_create("archives")

download_one <- function(filename){
  message(filename)
  Sys.sleep(5)
  download.file(glue::glue("https://stat.ethz.ch/pipermail/r-package-devel/{filename}"),
                file.path("archives", filename))

}

purrr::walk(filenames, download_one)

How I rectangled all of them

# archives holds all the txt.gz files
filenames <- fs::dir_ls("archives")
folders <- gsub("archives\\/", "", filenames)

purrr::map2(filenames, folders,
            tm.plugin.mail::convert_mbox_eml)

rectangle_email <- function(email){
  email <- tm.plugin.mail::removeCitation(email, removeQuoteHeader = TRUE)
  email <- tm.plugin.mail::removeMultipart(email)
  email <- tm.plugin.mail::removeSignature(email)

  if(is.null(email$meta$heading)){
    return(NULL)
  }

  tibble::tibble(author = email$meta$author,
                 datetime = as.POSIXct(email$meta$datetimestamp),
                 subject = email$meta$heading,
                 content = as.character(
                   glue::glue_collapse(email$content,
                                               "\n")))
}

rectangle_folder <- function(folder){
  emails <- tm::VCorpus(tm::DirSource(folder),
                        readerControl = list(reader = tm.plugin.mail::readMail))
  purrr::map_df(as.list(emails), rectangle_email)
}

emails <- purrr::map_df(folders, rectangle_folder)
readr::write_csv(emails, file.path("data", "emails.csv"))
fs::dir_delete(folders)

Now I have a big data.frame and will be able to play with it. :tada:

2 Likes

Blog post for which I needed to process text email data

Thanks again for the help!

2 Likes