maelle
February 2, 2019, 10:54am
1
I’d like to analyze the R-package-devel Archives . To do that I’ll need to parse “text emails”, in some sort of data.frame
with sender, date, object, body.
I’m assuming I’m not the first person ever to do such a thing but I was only able to find the REmail
package that interfaces Python email utilities via RPython
. Does anyone know of other tools, potentially only using R? Thanks!
1 Like
sckott
February 2, 2019, 4:36pm
2
1 Like
maelle
February 2, 2019, 5:11pm
3
Thanks, I’ll try it and report back.
maelle
February 4, 2019, 11:03am
4
So it worked very well, thanks @sckott ! I’ll try to remember to update this thread when I get a blog post out but in the meantime here are some notes.
How I downloaded all archives
polite
session <- polite::bow("https://stat.ethz.ch/pipermail/r-package-devel/",
user_agent = "Your own identity do not copy-paste ;o)")
library("magrittr")
polite::scrape(session) %>%
rvest::xml_nodes("a") %>%
xml2::xml_attr("href") %>%
.[grepl("\\.txt\\.gz", .)] -> filenames
fs::dir_create("archives")
download_one <- function(filename){
message(filename)
Sys.sleep(5)
download.file(glue::glue("https://stat.ethz.ch/pipermail/r-package-devel/{filename}"),
file.path("archives", filename))
}
purrr::walk(filenames, download_one)
How I rectangled all of them
# archives holds all the txt.gz files
filenames <- fs::dir_ls("archives")
folders <- gsub("archives\\/", "", filenames)
purrr::map2(filenames, folders,
tm.plugin.mail::convert_mbox_eml)
rectangle_email <- function(email){
email <- tm.plugin.mail::removeCitation(email, removeQuoteHeader = TRUE)
email <- tm.plugin.mail::removeMultipart(email)
email <- tm.plugin.mail::removeSignature(email)
if(is.null(email$meta$heading)){
return(NULL)
}
tibble::tibble(author = email$meta$author,
datetime = as.POSIXct(email$meta$datetimestamp),
subject = email$meta$heading,
content = as.character(
glue::glue_collapse(email$content,
"\n")))
}
rectangle_folder <- function(folder){
emails <- tm::VCorpus(tm::DirSource(folder),
readerControl = list(reader = tm.plugin.mail::readMail))
purrr::map_df(as.list(emails), rectangle_email)
}
emails <- purrr::map_df(folders, rectangle_folder)
readr::write_csv(emails, file.path("data", "emails.csv"))
fs::dir_delete(folders)
Now I have a big data.frame
and will be able to play with it.
2 Likes
maelle
April 11, 2019, 9:44am
5
Blog post for which I needed to process text email data
No matter how good your docs reading and search engine querying skills are, sometimes as an R package developer you’ll need to ask questions to your peers. Where to find them? R-hub has its own feedback and discussion venues, but what about R package...
Thanks again for the help!
2 Likes