How do you review code that accompanies a research project or paper? Help rOpenSci plan a Community Call

Hi there,

And thank you for the interesting discussion!

I wanted to share some steps we follow on the Scientific Computing Team at NCEAS when we work with scientists in archiving their products, which often consist in set of data inputs, scripts and outputs (data or/and figures).

  1. We try to run the code; yep, sounds trivial but often already let us check if we have all the necessary libraries and sourced codes, such as scripts containing custom functions. Moreover, I think the most important check this step does is data access. To be able to run an analytical scripts, you need to have access to all the input files, which can be problematic under the scenario of external review, especially if you process large datasets that might already be the results of a data collation effort. It can also be problematic for internal review, if you do not have a centralized way of managing your data (e.g. on a server with shared directories)
  2. Once we can run the script(s), we check that we get the same results as the output files that were provided to us. If it goes well, we move on to the next step; otherwise we start a discussion with the scientists. Mainly it checks potential version issues (both in script or data) and also runtime environment differences (pretty rare in our case, as we often set up our collaborators on our analytical server).
  3. Then we start to look at the code into more details; but I would not say we do a in depth review of the codes, as some are very specialized using complex models (which probably raise the question on how to clearly scope the code review process for reviewers). So far for this archiving step, we have mainly focused on improving code commenting to make sure others scientists can understand well what is going on in the code. We also ask our scientists to describe well their workflow when there are several parts/scripts to their analysis (still looking for the best tool to do so!).
  4. Finally, and this is a work in progress, we would like help scientists to modify their code from reading data from local file systems to directly pulling data from repositories. This was one motivation to start developing the metajam R package (https://nceas.github.io/metajam/) with @isteves and Mitchell Maier; aiming to provide simple functions to do so. We are not quite there yet and it might be outside the scope of this discussion, although with the growing requirement of archiving data with publication, it might be an interesting recommendation to facilitate the code review process.

Some other thoughts – to me it seems that code optimization is different from code review (as asked here). Refactoring your code to make it modular, reusable and or profiling it to make it more efficient is often a hard sell to scientists, especially when they are “done” with their analysis. I agree that working on training and recommendations on how to structure your projects (I need to check some of the propositions in this thread!!) seems the way to go and that these changes would be hard to achieve via code review. This being said, I think that an output of the review process should be to define and set up unit tests on the scripts that have been reviewed; this would be a good way to check on further developments/improvements.

I hope this is useful!

Julien

3 Likes