Associating an Rmd file with a commit

My projects are generally structured as R packages with an additional “notebook” folder in which I keep many Rmd’s of my day-to-day explorations. They actually often start out as just Rmd’s until I start moving code out of them Rmd’s into the R/ folder. However, since the codebase still changes a lot, this means the Rmd’s break as the project continues. It’s not really worth it to update all the Rmd’s as the codebase changes.

I’d like some way to associate a given Rmd with a commit, so I can easily return to the state at which it ran properly. Ideally one would do this so that one could automatically compile one, or all, of the Rmd’s under the state the project was in at their creation.

I thought of putting the SHA in the Rmd’s YAML, but this means an extra commit on top of each (workable but a bit inelegant, not great for clean commit history). If one amends the previous commit with this, it changes the SHA. Another option would be to just come up with a commit message ahead and put this in the YAML, as well. Other thoughts, or suggestions on a different workflow?

3 Likes

This may not be exactly what you’re looking for but, if you save a rendered .md or .html, you could put some git2r code in there to preserve the SHA, just like we tend to do with devtools::session_info(). I showed that in this SO answer: http://stackoverflow.com/a/32304217/2825349

Useful! But sometimes a compiled Rmd doesn’t have enough info to re-run (you hide some code blocks, etc.)

@ashander suggested that each notebook entry should have a branch, which may be the way to go. However, I’d like it to be relatively easy to browse through previous notebook entries, so perhaps the compiled markdown or HTML files should be committed to the master branch (in which case @jennybc’s suggestion would work).

I meant: save foo.Rmd AND (foo.md or foo.html). So you have source and at least one successfully rendered thing w/ SHA.

1 Like

I don’t like branches for this sort of thing, because how hard it is to get stuff on two different branches in front of your eyeballs at the same time.

3 Likes

Hey Noam,

Great question, something I’ve struggled with too. In my experience, the best thing is just to commit the output file as well and not worry about it.

As the codebase changes and breaks things, you still have the output file with the last successful results; and it’s really easy to just determine the SHA of when that output file was created by looking on GitHub or a quick git log on the file. You can then checkout that hash to return to the state that produced that output.

Actually even my more heavyweight approach in my notebook, which does capture SHAs and add them to the sidebar, does this after the fact, with Jekyll looking up the SHA hash of the post. Does that make sense or am I missing something?

In a related thread, I’m really curious how you all decide to organize outputs vs inputs. I’ve defaulted to a single directory that has both Rmd and md files, along with whatever filename_files/figure-github stuff rmarkdown defaults to, rather than messing with any knitr options. It avoids some working dir hell but it does make for cluttered directories…

2 Likes

A few thoughts:

  • instead of keeping the day’s explorations in many different files viewable at the current HEAD, why not a single file, e.g., notebook.Rmd that you modify without fear because it’s under git? your commit log IS your notebook…
  • Carl’s output suggestion is nice, but a lighter weight alternative is a pre-commit hook that adds the SHA (e.g., to any .Rmd in notebook) to the file itself. Here’s an example I googled up
1 Like

+1 I think I misunderstood the notebook purpose. But still not certain why @noamross using this structure. See my reply below.

@ashander Interesting idea about just sticking with one file, but I always feel I end up investigating several different threads which it is nice to both be able to view / edit simultaneously and preserve unique history for.

surely setting up pre-commit hooks counts as heavier-weight than doing nothing ;-). I assume you are already version controlling both the .Rmd (input) and .md output? I know that’s against convention but (as @jennybc has also convinced me) it’s definitely pro-convenience. (To make this concrete, here’s a quick example of a current project: https://github.com/cboettig/multiple_uncertainty/tree/master/inst/scripts . The only thing I don’t particularly like about this workflow is the resulting file structure is pretty opaque.

Good question @noamross

Another question, after you determine the SHA for a given .Rmd file, is how do you easily run the code in the file. I wonder if you could have a Makefile (or remake?) with commands that will simply take a path for the .Rmd file, and then the make command will grab the SHA, check out that commit, execute, then return back to master, or wherever you started. Or is this not the use case you’re going after?

I like @ashander’s pre-commit hook idea a lot! I just modified the one in devtools that helps you keep README.Rmd and README.md in sync. The hook makes it pretty hard to commit just one of those files or to commit .md when .Rmd has a later modification date. Maybe that could be extended for general foo.Rmd / foo.md pairs and to do this SHA insertion … But my lack of bash skills make this painful. Ratio of stackoverflow searches to lines of working code is very high :flushed:.

2 Likes

@sckott My strategy has been to lookup the version I wanted to reproduce (e.g. by browsing in GitHub, https://github.com/cboettig/multiple_uncertainty/commits/master/inst/scripts/carl_fig3.md), checkout the SHA, and then do devtools::install() and rmarkdown::render("stuff.Rmd") (or just the knit button). As Noam mentions, having the package structure and calling install is usually sufficient (provided one didn’t forget to commit them at the same time) to make sure it renders with the right version of the functions.

@ashander @jennybc ooh, apparently the pre-commit idea went right over my head. I do like how devtools does this for README.Rmd files already, so you can’t commit out-of sync versions. Maybe we can just crib the code from there?

1 Like

fair! but I meant lighter repo and directory clutter wise. And not having to git log -- day100.html for the sha. In the case where you’re using output (eg in gh-pages, your way seems preferable for sure), if you’re not committing output you can of course do git log -- day100.Rmd and checkout the commit

oh cool, nice looking fix I’d imagine for more complex hooks using a richer scripting language than sh would help immensely.

ah, very true. I don’t often actually find myself checking out old versions to repeat stuff, so I guess I’m content to manually do git log when I need. (In practice I’m more likely to first identify the SHA I need by browsing the outputs – “now how did I get that figure??” anyhow. That’s why I’ve gotten pretty fond of committing output mds. Also is why I tend to work with output as md over html – it’s convenient to page through earlier versions of output files on GitHub this way. Really hard to page through old versions of rendered html to find that time you got that weird figure or result. (and of course paging through the source-file history doesn’t help)

I’m working with jupyter notebooks now for my teaching too; and got a shock when I first realized that committing the notebook to git commits all the outputs (64bit encoded figures, etc) as well as the input values (though not the runtime environment). This can make for some pretty heavyweight repos, but they ain’t cluttered!

1 Like

I am versioning both .md and .Rmd. Mostly because it makes the .md easy to share on github. My notebook directory looks like a Jekyll _posts directory, should I in theory decide to share it that way.

The pre-commit hook idea is good! I may try that. I also might start using @jennybc’s simpler solution of getting the SHA via a git2r, and just storing it in a comment or a YAML data block with the the input filename and session_info() output.

I feel I am solidifying my right to the title: Our Lady of Intermediate Markdown.

5 Likes

as additional information about what language to use for pre-commit hooks, it should be noted that you can create Rscript based git hooks that are essentially R code, avoiding bash or sh code when needed. I have a pre-commit hook that increments the package version with every commit, but having something that adds the SHA to an MD file when generated sounds like a really, really good idea

1 Like

I also found another potential solution that is a bit of a hybrid between some of the solutions discussed above.

Hadley’s install_github adds the sha of a package to the DESCRIPTION file upon install, and this info is displayed in devtools::session_info. So a pre- or post-commit hook could be used to add the sha to the package description file, list packages used at end of document using session_info, and then commit the markdown or html generated. Or at least it sounds feasible in my head, may be some slight oddities in practice, bit I think it gets pretty close.