Continuing analyses where you left off in R Markdown notebooks

My R Markdown project notebooks tend to be a bunch Rmd files that include many exploratory graphs and models, and a lot of data wrangling to get the data into the appropriate forms for each. I’ll often want to repeat the data-wrangling steps in a new Rmd before trying something else.

In the past I’ve copy-pasted some of the data-wrangling code into a new file, but one can lose track how two copies of the same task change slightly. Functionalizing or extracting the data-wrangling code, on the other hand, seems like it can be more effort than it’s worth.

So I created a source_rmd() function (below, and in a gist) that extracts the source code from an *.Rmd and runs it, optionally setting a null graphics device for plots. Now I run this function at the top of notebook files to load the environment of a previous notebook, and start off with wrangled data and models.

Some questions:

  • Am I just lazy to avoid functionalizing my code? Is there a good rule-of-thumb as to when to do so? I tend to think that it’s when you start wrangling more than one data set the same way.
  • Is there a better approach to workflow within a project?
  • Is this implemented more robustly elsewhere? Any thoughts on improving it?
    • I thought about something that would select chunks to run, but decided that this was just the level of specificity at which functionalizing the code and putting in an external R/ folder makes sense.
  • Is there a good place for this function to live that I might submit a PR?
#' Source the R from an knitr file, optionally skipping plots
#'
#' @param file the knitr file to source
#' @param skip_plots whether to make plots. If TRUE (default) sets a null graphics device
#'
#' @return This function is called for its side effects
#' @export
source_rmd = function(file, skip_plots = TRUE) {
  temp = tempfile(fileext=".R")
  knitr::purl(file, output=temp)

  if(skip_plots) {
    old_dev = getOption('device')
    options(device = function(...) {
      .Call("R_GD_nullDevice", PACKAGE = "grDevices")
    })
  }
  source(temp)
  if(skip_plots) {
    options(device = old_dev)
  }
}

Heh, I think we might think about this problem in similar ways. I never got the hang of working with knitr/sweave documents so for years I’ve used sowsear to do this the other way around (work in an R script file that can always be source()'ed and generate the Rmd from that.

It think Yihui (the knitr author) is pretty receptive to ideas (check out the massive author list of the package). There’s a version of sowsear in knitr as spin. But for CRAN acceptance that .Call command is a no-no.

I think your idea is nice. And I think people downplay how complicated and potentially non-functionalsable analyses can be (or how functionalising can decrease readability in a knitr context)

I actually found that .Call() command in this post of Yihui’s. Maybe pdf(file=NULL) is sufficient.

From @ashander on twitter:

https://twitter.com/jaimedash/status/705050995244474368

Indeed, there’s a middle ground of source()able R files that aren’t functions. One issue that it can be non-trivial to separate the wrangling from the model fitting, and sometimes you want parts of both. I actually wrote this function yesterday because I didn’t just want the data wrangled, but wanted to overlay the model predictions from a previous analysis on a different set of data.