A great question Rob!
I think there’s some great answers to part one here already which roughly boil down to: make it so that creating the scaffold for a reproducible project is “a button” and press that button each time you start working on something. As @benmarwick suggested, you never know when something will gather steam and suddenly need to be ramped up, OR an idea that was shelved becomes interesting again and you can dust off attempt #1 without starting from scratch.
This brings us to part two of your question. In my view when we talk about reproducibility it should include this aspect. We tend to focus reproducibility talk on the numerical results, because yes that can be hard, but I think it’s important to consider how this project is going to reproduce all the background knowledge, assumptions, things that were learned along the way, in the mind of either someone else or yourself in X time intervals into the future.
For sure documentation is part of it. But I would argue the way the project is laid out right down the code can also play a large role. For example when you use {targets}
you are required to make a plan that describes the flow of data and computations toward an ultimate result. If we set out to write that plan in a human-readable way, making frequent use of well-named custom functions to abstract away low-level detail, using explicit variable names, surfacing the project’s assumed parameters as variables, etc, then the plan itself can become an important source of documentation for the high-level structure of the project.
What’s cool is this structure is also navigable. So if I put my cursor on a function that says get_the_data()
or make_the_model()
etc, I can use the “jump to definition” feature of any modern IDE to go directly to the source code for just that aspect of the project. I know precisely what the inputs of that code are (because it’s a function), and if I want to inspect the inputs I can just read them out of the {targets} cache.
The code that lives in those functions can also be written prioritising human-readability, e.g. adhering to a common standard (like https://style.tidyverse.org/), using explicit variable names, avoiding magic numbers, and “code-golf” style densely packed expressions etc. There are tools that such as {lintr}
and {styler}
that can lower the effort of doing this.
So when you have these layers of structure to your project’s code limiting the scope of what needs to be parsed, and aid human parsing by using a common dialect and favouring expressions that are easy to parse, I think you really can have something approaching “self-documenting code”. Certainly in my team, when you see a comment in code, you take notice - because they only appear when something that couldn’t be expressed cleanly is happening.
Also lockfiles. Highly recommend you make an {renv}
renv.lock
file for everything before you down tools on it. It can be done in seconds and may save you hours in the future. If you don’t want to bite off all of {renv}
, you might like capsule::capshot()
for this.