The spirit of a drake-friendly workflow

drake
Tags: #<Tag:0x00007fbc565fca98>

#1

I’m trying to understand the theory of how drake should be used.

I have a data analysis project with a workflow that looks like:

  • Directories like data/raw/, data/clean/, code/prep/, code/analysis, output [contains .Rmd reporting file]
  • Files in the code directories like 01_clean.R and 02_merge_and_transform.R

And then I make it “reproducible” by creating a master script that calls: source("code/prep/01_clean.R"), source("code/analysis/02_merge_and_transform.R)" etc.

How do I move from this workflow to a drake plan? My understanding is that some changes in approach are:

  1. Keep data in the environment rather than files. So should I no longer write to data/clean/ and then read from it in a subsequent script, just use the same names of objects in subsequent steps/targets?
  2. Focus on functions. So if I currently have 100 lines of code in 01_clean.R that clean a raw data.file and write it to a .csv, executed by source() in my current approach, I would need to make that into function(s)? The project is a one-off with unique data so I am not invested in creating functions.

I tried putting source("code/prep/01_clean.R") inside my drake_plan call, which worked - great! - but then when I changed something in that file and tried to run make(plan) again, it told me “All targets are already up to date”, not seeing the saved change to the .R file.


#2

tagging @wlandau

(lol min post length is 20 characters and the above wasn’t enough)


#3

It’s very much (2), focus on creating functions. Drake inspects function objects and Rmds to look for changes, not scripts, so changing the script doesn’t change things. (Though I think if you used file_in() it might watch the file for changes, though not look for changes due to files that file might refer to).

(1) is also correct, but less fundamental than (2).

The switch to functions as the top-level elements of a data workflow is a big barrier for me, too. I note it’s not unique to drake. remake had the same approach and I think most R-centric systems reach this conclusion. But for this reason I don’t find it worthwhile to convert old projects to drake - too much overhead - or use drake for small projects. The sweet spot for me is new medium-to-large projects.


#4

I think @noamross said it well. For your custom code, the focus is very much on functions in your workspace, which you should be able to source() from scripts beforehand in any order. As for initial input data, it is up to you whether to keep it in your environment or pull from external data files in commands like drake_plan(my_dataset = read_csv(file_in("my_data_file.csv"))). Downstream targets can also be files, which is handy if you have end-stage R Markdown reports to compile or data too big to fit in memory. But for small to medium datasets, you can let drake's caching system save your work for you so you do not even need to think about writing files. In the long run, managing your work can be much smoother this way.

library(drake)
plan <- drake_plan(  # The rows of the plan can be in any order.
  means = colMeans(dataset),
  dataset = data.frame(x = rnorm(100), y = rnorm(100))
)
make(plan)
#> target dataset
#> target means
readd(means) # See also loadd().
#>           x           y 
#> -0.19801315 -0.02546846

#5

At a high level, drake treats your project as a network of interdependent data transformations. Those transformations are the commands in your drake_plan(), and the commands depend on the functions you write and the packages you load. Contrary to tradition, drake takes cues from your R session rather than your code files. The way you store your functions in scripts is up to you.


#6

I am sure you have seen this example code and these slides, but I think I should mention them here for passers by. Also, I attempt to provide some guidance in the best practices page.