Workflow best practices

I have two issues that I bump into frequently and could use thoughts on what this community sees as the best way to address them.

The first has to do with how and when to manage the transition from an exploratory analysis to something more fully featured and reproducible. Thinking of the “if you have to do something three times, you should write a function” maxim to set up rules of thumb for when to start building up more scaffolding around a project. I find I have a tension between spending time building up too much structure for an analysis that ends up getting abandoned and/or not building up enough and then making refactoring difficult and onerous. Do people have any advice on what has worked for them?

The second, related, is what breadcrumbs, readmes, etc., that people leave for themselves when they are putting a project down for a while? I have a project that is structured around Jenny Bryan’s Analysis API that because of a data dependency I had put down for almost a year. I’m now picking it back up and spending more time than I’d wished getting reoriented. To be fair I hadn’t left it in as good a set up as I now wished I had, but still I’d like to avoid this pain in the future. Any thoughts on what people have found helps their future self in these situations?

Cheers,
Rob

7 Likes

For me, the answer to this question has changed as the available tooling (and my familiarity/comfort with it) has lowered the activation energy for setting up something fully reproducible. targets, renv, and git are my go-to frameworks for reproducible analysis and I use them whenever I pass the stage of “more than can be done in a file of 100 lines or so”.

The trick is lowering the effort it requires to start out a project this way, and sticking with a common project structure so that you don’t have to re-invent or re-factor with each project. To do so, it’s good to have a project template (or a good enough idea of one that you can just copy your last project and clear parts out).

This is a good talk by @MilesMcBain about setting yourself up so your reproducible workflow helps you do analysis quickly, rather than adding more overhead: Miles McBain - That Feeling of Workflowing [Remote] - YouTube

As for documenting things for yourself, I find this tougher. A good file structure and/or targets-like workflow graph provides some “self-documentation”, but it definitely has its limits. My main advice for documentation is less about the content than regular habits: try to make sure you document a few bullets or sentences every time you work on a project, so you don’t end up having done a month of work and then have a huge amount to write down.

6 Likes

@noamross - thanks heaps for the reply. Super helpful and I appreciate the link to Miles’ talk. A follow-up for you. A while back we had communicated about data packages after I’d watched your data package talk. How do you feel this approach fits in with a targets based workflow? Do you still think it’s worthwhile to develop data packages separately? I’m contributing to a project that is collecting similar, but not identical, data over a 5-year period and have experimented with both a csv approach and used DataPackagerR. Do you feel the data package is still the best tool to use here? (I know there’s no best!)

cheers

2 Likes

Like Noam, I’m finding targets and renv to be excellent for setting up a reproducible project.

My opinion is that the R package structure is an ideal scaffold for organising a research project. We have a pkg, rrtools, that generates a generic data analysis project structure based on the R package structure, with an R Markdown document at the hub of the project. This works especially well if writing a journal article or short report is the goal. We wrote more about this approach in our article Packaging Data Analytical Work Reproducibly Using R (and Friends)

For me, often the tiniest project sketch starts with running a few lines from rrtools to set up, and then I’ve got the full project structure already, in case the project gets momentum and turns into something viable.

7 Likes

Thanks @benmarwick - much appreciated. Do you have an example template of how you’ve incorporated
rrtools together with renv and targets?

1 Like

I have an example of a project that was started with rrtools and then we used renv and targets with it also: https://github.com/parkgayoung/racisminarchy.

We ran through the template-generating steps using rrtools, then when the project was viable, we:

and then iterated on the targets file as we understood more about how it works :slight_smile: I think we still have quite a bit to learn, but so far I’m really pleased with how it helps to organise the workflow and save runtime.

3 Likes

A great question Rob!

I think there’s some great answers to part one here already which roughly boil down to: make it so that creating the scaffold for a reproducible project is “a button” and press that button each time you start working on something. As @benmarwick suggested, you never know when something will gather steam and suddenly need to be ramped up, OR an idea that was shelved becomes interesting again and you can dust off attempt #1 without starting from scratch.

This brings us to part two of your question. In my view when we talk about reproducibility it should include this aspect. We tend to focus reproducibility talk on the numerical results, because yes that can be hard, but I think it’s important to consider how this project is going to reproduce all the background knowledge, assumptions, things that were learned along the way, in the mind of either someone else or yourself in X time intervals into the future.

For sure documentation is part of it. But I would argue the way the project is laid out right down the code can also play a large role. For example when you use {targets} you are required to make a plan that describes the flow of data and computations toward an ultimate result. If we set out to write that plan in a human-readable way, making frequent use of well-named custom functions to abstract away low-level detail, using explicit variable names, surfacing the project’s assumed parameters as variables, etc, then the plan itself can become an important source of documentation for the high-level structure of the project.

What’s cool is this structure is also navigable. So if I put my cursor on a function that says get_the_data() or make_the_model() etc, I can use the “jump to definition” feature of any modern IDE to go directly to the source code for just that aspect of the project. I know precisely what the inputs of that code are (because it’s a function), and if I want to inspect the inputs I can just read them out of the {targets} cache.

The code that lives in those functions can also be written prioritising human-readability, e.g. adhering to a common standard (like https://style.tidyverse.org/), using explicit variable names, avoiding magic numbers, and “code-golf” style densely packed expressions etc. There are tools that such as {lintr} and {styler} that can lower the effort of doing this.

So when you have these layers of structure to your project’s code limiting the scope of what needs to be parsed, and aid human parsing by using a common dialect and favouring expressions that are easy to parse, I think you really can have something approaching “self-documenting code”. Certainly in my team, when you see a comment in code, you take notice - because they only appear when something that couldn’t be expressed cleanly is happening.

Also lockfiles. Highly recommend you make an {renv} renv.lock file for everything before you down tools on it. It can be done in seconds and may save you hours in the future. If you don’t want to bite off all of {renv}, you might like capsule::capshot() for this.

5 Likes

I just want to say thanks to @noamross @benmarwick & @MilesMcBain for taking the time to write thoughtful and helpful responses - including the links and videos therein (who knew the mental model of fish factories could be useful for flow?). The responses were really helpful for me to inventory what is working for me now, and what can move my process forward. Again, much appreciated

3 Likes

Sorry, I missed your question earlier. I disagree with @benmarwick a little bit in that I think with renv and targets the ideal folder structure is a bit different, but not far from, an R package (Here’s an example of a template we use for projects: https://github.com/ecohealthalliance/container-template). As for a data package, the really depends. Tools like arrow and duckdb have really changed in the past few years and if you are able to process data into CSV or parquet files that can be read from some remote URL, there’s little need for a package unless your data-processing functions are elaborate (@MilesMcBain may disagree a bit here, I think he wraps every data source in a package). But you may need a targets like pipeline to regularly process the incoming data into that form.

3 Likes

@noamross Your comment surprised me, because although most of our data sources are packaged, the driver behind that is not some agreed policy that that’s the “thing to do”. I feel like it’s been the natural progression following other practices that relate to avoiding copy-pasting code between projects, and making sure we have standard ways of addressing and calculating standard measures.

A new dataset might enter one project as file or ad-hoc query, and then is later ported to a package as it becomes clear that it is going to be something we are going to refer to regularly across multiple projects. It might take more than one subsequent usage for that to happen! The prior projects tend not to be updated to use the “packaged” version of the data unless they are dusted off again for new work.

3 Likes

Also to be clear, we actually only have one traditional “data package” with a dataset embedded in the package folder structure itself. The suite is predominantly packages that wrap database queries or fetches from S3 buckets.

1 Like

Sorry to misrepresent your approach, Miles! What you describe makes a lot of sense to me. In the past I’ve over-engineered by wrapping data sources as packages when just importing data from a database, URL, or S3 works fine.

1 Like