Resources on Project Directory Organization

I recently have started supervising some student projects and was looking for a simple primer on project folder organization and file naming that I could give students with little scientific computing background. So I asked twitter:

A few people asked me to compile the responses (and I got a lot of responses! Thanks y’all!). So here’s a short summary of what I found:

  • The simplest and most straightforward post I found that described the how and why of directory organization was this one by @richfitz. A nice follow-up by Marcela Diaz gives an example good for novices. Ultimately I ended up sending these to my students as I thought they required the least prior knowledge.
  • There’s a nice lesson created at the Reproducible Science Hackathon that has two short slide shows - one on folder organization and one on file naming. It may be inevitable, but the one on file naming goes a bit into regular expressions and globbing, which was a bit advanced for my students. This lesson is being incorporated into the Data Carpentry curriculum.
  • Some of that lesson is drawn from these notes from @jennybc’s STAT545 course. As usual, Jenny is excellent at dealing with how things evolve in real projects, and lends two important concepts:
    • “Quarantining the crazy” - putting unstructured or poorly organized data in its own read-only folder
    • Dealing with different types of analysis scripts (processing, analyzing visualizing).
  • A nice short lesson on this topic is part of the R for Reproducible Science lesson at Software Carpentry. It focuses on using RStudio projects and recommends using the ProjectTemplate package to automate setup.
  • @cboettig has a post from 2012 about using the R package structure as an organizing framework for projects. It presumes some knowledge of R packages and git.
  • This idea is extended in a write-up from last year’s rOpenSci unconference describes the idea of a research compendium, which includes not just a project directory but other tools for reproducibility described in standard ways. It describes the use of some more advanced tools like Docker.
  • The in-progress draft of the *Carpentry community’s paper “Good Enough Practices for Scientific Computing” has a section on project organization. It’s slightly more Python-centric and includes the concept of different directories for source code and compiled code.
  • Everyone has pointed me to Noble’s “A Quick Guide to Organizing Computational Biology Projects”. I like this one because it has an explicit discussion of a lab notebook as part of the project directory. It’s a little more focused on computational experiments than data-driven projects.

Ultimately most of these are pretty similar conceptually, which one might expect as most things came from within the *Carpentry/rOpenSci communities. (Lots of people like to link to this tweet from Vince Buffalo). The main difference I found is that some describe things from a more ad-hoc, evolving project perspective (e.g., the STAT545 curriculum), while others describe a static unit of reproducible research (the research compendium write-up).

The other area of difference tends to be that R-focused guides (at least on the more advanced side) describe knitr/Rmarkdown-type literate programming documents as part of the project structure, while the Python and language-agnostic guides do not. That may just be due to me missing some examples due to bias in my network, or it may have to do with the way people use iPython/Jupyter notebooks.