How do you review code that accompanies a research project or paper? Help rOpenSci plan a Community Call

naupaka · September 7, 2018, 4:24am

A few thoughts:

I think a simple checklist of the most critical things to get checked on all analysis code before submission of a manuscript would be super helpful. There are a lot of niceties that could be included, but perhaps fewer necessities, especially for students and others who are relatively new to R. I think there’s also a distinction between guidelines for data/code/archival quality and guidelines for stats/viz/analysis quality, and those are related but not necessarily entirely overlapping.

Relatedly: I think there’s a difference between code that is central to the findings of a manuscript (say for theoretical or simulation studies), and code that is used in the presentation of results. For example, an undergrad or graduate student who has analyzed their data and written it all up in an Rmd might not need the code to be fully functionalized and refactored and unit tested, but I think it is absolutely still critical that it is documented and organized such that it at the very least easily runs on another machine without much fuss. In classes, I have Travis check for successful knitting of student assignments (submitted via PR on GitHub), which requires machine readable documentation of non-base packages, including any installed from GitHub, and data in the repository (with proper relative paths!) or else pulled from the cloud, and working syntax. I also have everything strictly linted with lintr to enforce basic code style conventions. For (often DNA sequencing-based) analyses with large data files (10s to 1000s of GB), the primarily analysis scripts are pulled out into R scripts and run separately, and then the intermediate results are saved as a csv when possible and Rdata when necessary; this then become the data that’s loaded at the top of the Rmd to further munge, visualize, run stats on, etc. Only the code in the Rmd has to run successfully to keep Travis happy. I haven’t yet found a good way to automate testing on scripts that take days to run on a huge server. I suppose linting those too would be a fair start, and perhaps encouraging more liberal use of stopifnot() throughout?.

Repos that are associated with manuscripts get a version tag and are archived to Zenodo, which mints a DOI, which we then cite in the manuscript.

I think a big part of doing code review on these type of project efficiently and effectively is having a good default compendium-type structure, as @noamross and @cboettig mentioned. I have personally felt a little torn between some of the different options out there (ProjectTemplate, rrtools, the others listed at https://github.com/ropensci/rrrpkg and the Compendium Google Doc etc).

Topic		Replies	Views
Community Call Summary - Code Review in the Lab Blog r , codereview , community , community-call	0	633	November 29, 2018
Call for contributors for a paper on code review Software-Review	1	859	September 24, 2018
How rOpenSci uses Code Review to Promote Reproducible Science Blog onboarding , codereview	0	665	September 1, 2017
How could the onboarding / package review process be even better? Package Development	16	3872	September 20, 2016
Use of R package review guidelines in independent manuscript review UseCases software-peer-review , dev-guide	2	1605	August 8, 2019

How do you review code that accompanies a research project or paper? Help rOpenSci plan a Community Call

Related topics