How to process text email data?

text
text-mining
email
Tags: #<Tag:0x00007fdaaf4f5768> #<Tag:0x00007fdaaf4f5628> #<Tag:0x00007fdaaf4f54e8>

#1

I’d like to analyze the R-package-devel Archives. To do that I’ll need to parse “text emails”, in some sort of data.frame with sender, date, object, body.

I’m assuming I’m not the first person ever to do such a thing but I was only able to find the REmail package that interfaces Python email utilities via RPython. Does anyone know of other tools, potentially only using R? Thanks!


#2

maybe tm.plugin.mail https://cran.rstudio.com/web/packages/tm.plugin.mail/ haven’t tried it though


#3

Thanks, I’ll try it and report back. :slightly_smiling_face:


#4

So it worked very well, thanks @sckott! I’ll try to remember to update this thread when I get a blog post out but in the meantime here are some notes.

How I downloaded all archives

polite


session <- polite::bow("https://stat.ethz.ch/pipermail/r-package-devel/",
            user_agent = "Your own identity do not copy-paste ;o)")

library("magrittr")

polite::scrape(session) %>%
  rvest::xml_nodes("a") %>%
  xml2::xml_attr("href") %>%
  .[grepl("\\.txt\\.gz", .)] -> filenames

fs::dir_create("archives")

download_one <- function(filename){
  message(filename)
  Sys.sleep(5)
  download.file(glue::glue("https://stat.ethz.ch/pipermail/r-package-devel/{filename}"),
                file.path("archives", filename))

}

purrr::walk(filenames, download_one)

How I rectangled all of them

# archives holds all the txt.gz files
filenames <- fs::dir_ls("archives")
folders <- gsub("archives\\/", "", filenames)

purrr::map2(filenames, folders,
            tm.plugin.mail::convert_mbox_eml)

rectangle_email <- function(email){
  email <- tm.plugin.mail::removeCitation(email, removeQuoteHeader = TRUE)
  email <- tm.plugin.mail::removeMultipart(email)
  email <- tm.plugin.mail::removeSignature(email)

  if(is.null(email$meta$heading)){
    return(NULL)
  }

  tibble::tibble(author = email$meta$author,
                 datetime = as.POSIXct(email$meta$datetimestamp),
                 subject = email$meta$heading,
                 content = as.character(
                   glue::glue_collapse(email$content,
                                               "\n")))
}

rectangle_folder <- function(folder){
  emails <- tm::VCorpus(tm::DirSource(folder),
                        readerControl = list(reader = tm.plugin.mail::readMail))
  purrr::map_df(as.list(emails), rectangle_email)
}

emails <- purrr::map_df(folders, rectangle_folder)
readr::write_csv(emails, file.path("data", "emails.csv"))
fs::dir_delete(folders)

Now I have a big data.frame and will be able to play with it. :tada: