Creating a package to reproduce an academic paper

ericpgreen · June 12, 2018, 4:07pm

My objective is to share a fully reproducible analysis that will allow anyone to generate our pre-print. I’ve tried using packrat in the past, but I just could not seem to get it to work right. In many ways, the easiest option I’ve found is to create a Docker image file that will fully re-create the environment for reproducing the paper.

But I’m also intrigued by the idea of creating a package for the paper. I’m wondering how to create a package for maximum reproducibility. Specifically, I’m looking for advice on external dependencies. In his book “R Packages”, Hadley has the following advice:

You almost always want to specify a minimum version rather than an exact version (MASS (== 7.3.0)). Since R can’t have multiple versions of the same package loaded at the same time, specifying an exact dependency dramatically increases the chance of conflicting versions.

Following this advice (MASS (>= 7.3.0)), it seems possible that a future user could use a different version and potentially encounter broken code or obtain different results.

Is there a way to create a package to always reproduce a paper exactly, or is the answer to use Docker?

noamross · June 12, 2018, 4:44pm

I’d say using Docker or a similar environment is the best approach, especially if you have external system dependencies. That said, I try to aim for layered reproducibility, such that you maximized the reproducibility of the project within successive tools.

Making an R package is a very good way of wrapping up the pieces of the package for re-use, with dependencies specified in the DESCRIPTION. This has its limits, as you mention above. So I prefer to also have Dockerfile a creates the environment, based on a Rocker image. This has the advantage of being easy to deploy on a lot of services, too. With Rocker docker images with fixed R versions, MRAN is used as the CRAN mirror with a fixed date, which should ensure that the same versions of packages always install. Of course something could happen to MRAN or Rocker or another upstream dependency that would prevent building this environment, so as a third layer, one can store the binary of the build in your repository, as well.

Each successive layer is slightly more involved to use, and less easy to incorporate into other work, but more likely to be reproducible in the future.

cboettig · June 12, 2018, 5:06pm

I really like Noam’s notion of layered reproducibility. I see the additional layers as “fall back” options, which can be useful down the road as software changes. To grab a concrete example, I’ve tried to do this here: https://github.com/cboettig/noise-phenomena .

In principle, most readers familiar with R can probably just copy-paste code out of the Rmd appendix successfully with no concern about versions.

Layer 2 is the R package: you can install the compendium locally using install_github(), this will grab dependencies automatically. Note that dependencies used only for formatting-related niceties (extrafont, hrbrthemes are listed as “Suggests” and not installed by default. the code should run without them, you’ll just get different fonts. install_github("cboettig/noise-phenomena", dep=TRUE) will try to install these as well, but note that some system dependencies (just fonts, in this case) are required for these; so I thought it helpful to make them “soft dependencies”.

Layer 3 is the Dockerfile. This should provide a stable, complete environment, along with system dependencies. This layer should be easy to deploy either in a version stable environment (e.g. rocker/binder:3.4.4 or the latest environment, for comparison. Short-term this is probably too much of a nuisance for most users who won’t have Docker installed, but long-term this should be an easier way to manage dependencies than packrat.

Layer 4 is perhaps binder runtime environment. In some sense this is the most comprehensive, since it includes ‘hardware’ as well as the runtime environment, but should also be the least friction in that it doesn’t require anything to “install”, just click the “Binder” button. This just runs the Docker image on the Binder platform and drops you into a familiar RStudio setting.

I’m still experimenting with these approaches myself, so feedback would be great as to what makes sense and what seems too complex or fragile in this approach. Thanks for your question!

mbjones · June 12, 2018, 5:06pm

I identify with the ‘layered reproducibility’ approach as well, and think codifying the R code that is used in a package is a great step. The DESCRIPTION file, however,seems more oriented to describe the requirements for a package rather than the runtime environment, which is what we need for provenance. So, we have written the recordr package in R to capture the dependencies in use at the time some code was run and to package that up in a datapack container, along with input and output data dependencies, and a full provenance trace using PROV.

But we too have turned to Docker to capture the computational environment in which an analysis was performed, with specific versions of software used. The vagaries of the OS and library layer on various platforms makes it so that knowing the R packages alone is insufficient. We’ve been working on a system called Whole Tale that supports running ‘environments’ as docker containers, and being able to launch and publish those from a web-based dashboard to archival repositories in DataONE, Globus, and other environments. A recent paper describes our vision, approach, and prototype:

Brinckman, A., K. Chard, N. Gaffney, M. Hategan, M. B. Jones, K. Kowalik, S. Kulasekaran, B. Ludäscher, B. D. Mecum, J. Nabrzyski, V. Stodden, I. J. Taylor, M. J. Turk, and K. Turner. 2018. Computing environments for reproducibility: Capturing the “Whole Tale.” Future Generation Computer Systems. Redirecting

Right now we support RStudio and Jupyter as our major environments, but the system supports arbitrary user-contributed Docker containers as well. We’ve been experimenting with the system and have a functional alpha release, and are in the midst of a complete UI revamp after a round of usability testing. But we are looking for people that have interest in reproducible science and might want to explore publishing their models this way, and we’re planning a reproducibility hackathon this Fall to bring interested parties together to explore various approaches. So let me know if anyone has interest in that.

Matt

ericpgreen · June 12, 2018, 5:12pm

Thanks for these thoughts. I was feeling like repo + package + Docker was redundant, but now I can give it a name: layered reproducibility! The concept makes sense. @mbjones will be good to check out Whole Tale as well.

cboettig · June 12, 2018, 5:16pm

@ericpgreen quick postscript to your specific question: you can also use packrat as a “Layer” in this approach (e.g. packrat stand alone, or packrat + Docker). The issue you quote from Hadley does not apply to packrat management. Hadley’s talking about general-use packages. packrat doesn’t care what you put in a DESCRIPTION (you don’t need a DESCRIPTION), it locks all versions and installs a custom library that’s isolated from the user library, so there is no potential for conflict!

The two main problems with packrat, as others have said, are (a) it doesn’t capture “system” dependencies: non-R libraries needed by R packages (e.g. libv8-dev, libxml2-dev, etc), and (b) it is super cumbersome to use because it installs a duplicate copy of every package and those package dependencies. That could easily be 100s - 1000s of packages on a typical data science analysis. Deploying packrat on linux at least also means those packages must be installed from source, which can take hours. no fun.

npjc · June 13, 2018, 3:07am

Good discussion here so I felt I may be able to ask for feedback on another solution.

Current model: The paper represents a stopping point. One cannot (easily?) stop time and therefore try to capture a snapshot to put it in a [package,container,etc…], time marches on and eventually it doesn’t work anymore.

Different model: The paper is live. It keeps up with the progression of time, and tools and whatnot until one of the following happens: 1. The opportunity cost of maintaining this becomes too large and the paper now becomes static, without a need for reproducibility per se. 2. A debate or systematic analysis allows the key analysis in the paper to become a subset of a new paper (which then goes live and incurs the burden of reproducibility). 3. Consensus is reached and the findings are now referred to only in domain-specific presumptions or as historical markers.

Apologies if I’m off topic but I appreciate the feedback.

cboettig · June 13, 2018, 3:56am

@npjc These are good questions. I’ll give my take but others may have different opinions and insights as well.

I think you containers do essentially give us an easy way to stop time, if done correctly. For instance, the rocker versioned images are fixed to a build date corresponding to the last day said version of R was current. So if my paper runs in, say, rocker/binder:3.4.3, then that container will always run R 3.4.3, and always contain any CRAN / bioconductor packages fixed to whatever version was current on the day before the 3.4.4 release (i.e. 2017-03-15, as logged here). Similarly system libraries installed from apt are fixed to the debian release, i.e. they will always come from debian:stretch (aka debian:9), even after later debian releases. Sure, it’s an open question for how long people will be able to deploy today’s docker container on the machines of the future, or to deploy a old versions of debian since the recipe isn’t docker-specific anyhow; but so far past versions of debian have been pretty persistent. No snapshot is perfect and docker doesn’t capture details at the level of the kernel or the hardware, but unless you’re writing papers specifically about hardware or kernel performance we can hope those don’t impact reproducibility…

The dynamic model you pose is very interesting, but I don’t believe it has to be an entirely orthogonal approach. Like you say, for me, it’s often hard to justify the time to “update” an old analysis to the latest code base. However, my future work will often build on parts of a previous work, and I hope (often with little evidence to show for it) that some of the code will be useful down the road not just for me but for others, so there’s value in allowing it to evolve.

As I’ve mentioned in a related thread on the RStudio Community, I try and put most non-trivial code related to an analysis into a separate package. I try to keep a “research compendium” that is associated with a particular paper or result to be relatively free of custom functions: i.e. ideally only .Rmd notebooks, no R/ directory with namespace etc. I try to move these custom functions into a separate R package that I can depend on (linking by GitHub release for version stability) across multiple projects, and keep these up-to-date to the extent that I and others are using them. By treating anything meant for possible re-use as “software” that can both evolve and be snap-shotted in time, separate from a particular paper which will inevitably fossilize at something close to it’s published form, I think I get something a bit more dynamic and hopefully reproducible.

Not sure if that made any sense; but just the practice I have currently evolved towards by dint of trying various other permutations. Thoughts welcome!

brycem · June 13, 2018, 6:25pm

Everything mentioned here (R packaging, docker, packrat, WholeTale, etc) is great. The only other thing I’d like to add as a way to increase the odds of reproducing your research would be to consider using Continuous Integration (such as Travis CI, Circle CI, etc.).

When I’ve really need to know if another person can run my project/analysis/whatever, I’ve turned on Travis CI to make sure my Rmarkdown documents build in what is effectively a Travis CI’s clean room environment. As a start, you could have your CI service render your Rmarkdown and, as a step up, using a combination of testthat and/or assertr to fail the build job if one or more results isn’t obtained (e.g., fail if mean of some quantity isn’t within some range of a known value).

I’m not sure about integrating some of the other tools such as rocker with Travis CI but maybe others have some good ideas.

cboettig · June 13, 2018, 7:03pm

That’s a great point @brycem! One of the wrinkles in using CI is that there’s no obvious way to “freeze” the versions of software (though packrat might work) when using the default builds. This can be a good thing in that it’s a good way to catch the fact that things may be broken by some upstream software update.

The other trick with using CI is making sure that it actually continues to run – once an analysis is finished, I’m not likely to make any commits to the repository.

To address the latter, CircleCI provides a relatively convenient way to schedule your repo for weekly builds using a cron job style syntax in the config yaml. This is much easier than having to ping the CI service yourself.

To address the former issue, you can always have the CI system run the Docker container directly, rather than running the code on it’s “native” system. I tend to do this on Circle-CI anyway since it doesn’t have a native “R” environment. Since containers can be version-locked or fixed to latest, this makes it easy to test against a fixed version and also the latest simultaneously.

brycem · June 13, 2018, 7:25pm

Awesome! That’ll come in super handy in the future I bet. This neatly dovetails into what @npjc was mentioning (“Living paper”).

seabbs · June 14, 2018, 4:37pm

This can also be done with travis CI, again using CRON (From the more options → settings section in the web interface). Comes in very handy for checking packages.

brycem · June 14, 2018, 6:42pm

Nice! I’m learning all the things in this thread.

benmarwick · June 18, 2018, 3:14am

Yes, I do this also and have found it saves a huge amount of frustration that comes from dealing with little oddities of Travis/Circle-CI

annakrystalli · June 18, 2018, 8:23am

It makes so much sense to use R package conventions and development, dependency management, documentation, validation and testing tools! I feel rrtools goes a long way to addressing many of the additional requirements in a research context that have been discussed here.

It helps set up your project as a research compendium (see notes from runconf17 on this topic), which can include a rocker image to reproduce the analysis and manages functions (including documentation and CI) as a package and analyses loosely as vignettes (there’s a few options).

I do wonder however whether the r community would benefit from a feature like python’s nbval, a py.test plugin to validate Jupyter notebooks.

januz · October 17, 2018, 11:38pm

Thanks for the detailed description! I’m in the process of preparing my first reproducible paper and your explanations clear up most of my questions regarding “freezing” package versions. Nice touch with the Binder integration, too…

I have one last question regarding package versions though: What about cases in which one wants/needs to use an older package version than the latest one available for the specified R version for at least one of the used packages?

Thanks!

cboettig · October 18, 2018, 5:55am

Yup, great question! Using a Docker image you can specify which version of R you would like. Those images will also ensure you get the same version of R packages as were current then as well. e.g. the Dockerfile used by Binder in the noise-phenomena paper is locked at R 3.5.0: https://github.com/cboettig/noise-phenomena/blob/81278891a2625d50cbb8c9e2bf18ec53f271017a/Dockerfile

noamross · October 18, 2018, 1:09pm

In addition to fixing the R version, as @cboettig describes, you can specify dates and versions of in a couple of different ways. In Rocker containers, the R packages are fixed via the snapshotted MRAN repository at the last date that version of R was available. Setting the MRAN environmental variable when running the container will let you install things via that date. devtools gives you ways of installing specific package versions individually. The best way to do this create a Dockerfile that installs the versions you want, for instance:

#Start with a Rocker container. 
All pre-installed packages will be the versions on CRAN the day before 3.5.1 came out.
FROM rocker/tidyverse:3.5.0  
MAINTAINER you@you.you

 #install the following packages to be the versions available on this date
# (80% sure this is the right way, or do I need options() @cboettig?)
ENV MRAN=https://mran.microsoft.com/snapshot/2018-07-01
RUN install2.r pkg1 pkg2  

#install a specific version of a package from CRAN
RUN Rscript -e "devtools::install_version('pkg3', version = '1.0.2')" 

#install a specific version of a package from GitHub using a commit hash
RUN Rscript -e "devtools::install_github('owner/repo@86h756')"

cboettig · October 18, 2018, 2:58pm

yup!

Of course you can also use packrat in your repo (imho not necessary if things work when all packages are up-to-date with the MRAN snapshot of CRAN, but can be a good way if have lots of packages that maybe haven’t been updated in a while and thus don’t match the versions at any given snapshot date).

januz · October 18, 2018, 7:11pm

Thank you, @noamross and @cboettig for explaining the various ways to fix package versions. Much appreciated!

Topic		Replies	Views
Docker and general scientific computing set-ups	8	6143	July 27, 2015
rOpenSci \| How to Cite R and R Packages Blog	1	508	November 30, 2021
Help needed with docker for reproducible research paper in knitr General Q&A	4	1656	February 22, 2015
Overlap policy for package onboarding Software-Review	14	3183	May 5, 2016
Reproducibility in R package building with travis and packrat r , package , packrat , travis	5	2378	September 6, 2016

Creating a package to reproduce an academic paper

Related topics