A research compendium and methylation raw data

These are interesting questions, and the best answer depends on what we’re trying to optimise - ease of engagement in the analysis by other researchers; cognitive load on the original researcher preparing this compendium, etc., and what the community expectations are about what counts as data and method. I know this can get a bit messy in the mangle of practice, and it can be tricky to get a balance of organisation that follows the best practices without feeling too unnatural and complicated. I am not familiar with the norms of data sharing and organisation in molecular biology, perhaps Jenny and the others have some more specific insights from their experience with this. But based on Gustavo’s description above, I’d suggest putting these items in data/data_raw/:

  • general CSV file describing the samples along their clinical information
  • the series of IDAT files (proprietary binary format from Illumina) containing the output from the scanner.

And then other data files that you derive from these CSV and IDAT files can go in data/data_derived. For example, you say that you work with R binary representations of the data, so if you process the IDAT files and save the output as .RData or .rds files, I’d store those .RData or .rds files in data/data_derived.

This makes is nice and clear to other people browsing your compendium what are the files that you started with, direct from the instrument/clinic, and what are the files that you generated as a result of decisions you made about your analysis. Then it is will be easy for others to see your decisions and make judgement about how reliable they were. If these are landmark data that are likely to be very widely used, then perhaps they should be prepared to go in a data pkg by themselves. But assuming this is a compendium for a research paper, I think it’s better to make the workflow clear with raw -> derived in the file structure.

My suggestion would be for all R scripts for preprocessing and filtering etc. to go in analysis/scripts/ (is this the same as your option 2?). Generally I see in the literature that it is a good habit to separate data from method, so keeping scripts in a separate directory from the data files helps with this. This will make it easier for others looking at your project to quickly see how it all fits together, for example if you number your scripts 001_..., 002_..., then it’s clear what order to use them in. If your scripts are scattered across analysis/ and data/ then it may be harder for other researcher to know how to make use of them to reproduce and extend your work.

If you do a set of preprocessing and filtering steps repeatedly on many data files, it may be more efficient for these steps to be documented functions in the R/ directory of the compendium-package, rather than scripts.

A makefile or R Markdown file that puts the preprocessing, filtering and analysis steps in order should probably go in the analysis/ directory.

Incidentally, the rrrpkg essay has a new version in the form of a pre-print here: https://peerj.com/preprints/3192/, soon to appear in The American Statistican

We also have an R pkg https://github.com/benmarwick/rrtools, based on the rrrpkg ideas, to help with quickly getting started on making a basic compendium suitable for doing reproducible research with R.

Let us know how you go!

1 Like