A research compendium and methylation raw data

Tags: #<Tag:0x00007fa100549888> #<Tag:0x00007fa1005494c8> #<Tag:0x00007fa100549360> #<Tag:0x00007fa1005491a8>


Hello everybody,

I am currently working as a bioinformatician at a molecular biology lab. I am very interested in reproducible research in general, and I find the research compendium idea to be quite an important building block for the development of better data analysis methodologies. Thus, I wanted to test the idea by myself on a small data analysis collaboration I am currently working on.

Most of my work is based on the analysis of DNA methylation microarrays. The raw data I usually receive is made of a general CSV file describing the samples along their clinical information, and a series of IDAT files (proprietary binary format from Illumina) containing the output from the scanner. There are a lot of papers in the literature talking about analysis pipelines for this type of projects, and most of them agree on several initial steps involving data preprocessing, quality check and probe filtering based on different criteria.

On the rrrpkg page describing a research compendium, the data folder is reserved for raw data in standard formats, that is, the formats a R package is expecting there: .tsv, .R, etc… At first, and based on Pakillo’s template, I decided to create a data_raw folder where I could store my raw data, and then write some R scripts for cleaning and storing it in the data folder.

Problem is, analysis of this type of data usually starts from beta values, which are indicative of DNA methylation and are obtained after the aforementioned preprocessing and filtering. This steps are designed in such a way that they are easier to tackle if we work with R binary representations of the data than if we try to load the data from standard CSV files. So I am in the mental process of deciding whether:

  1. Put the preprocessing and filtering code in the data_raw folder scripts, and consider it as something necessary to create the real raw data which is going to be the start point for the analysis project. In this case, all the preprocessing and filtering logic will be stored in the data_raw folder as R scripts. I do not know either if I should put these steps inside a Makefile in the analysis folder. And there is yet another problem: entries in the data folder should be documented, but if they are dynamically generated it would lead to a little bit of manual mangling each time we decide to change the filtering criteria or the preprocessing method.
  2. Simply use the R scripts in the data_raw folder to clean and tidy our inputs, and put them inside the data folder. Then, all processing and filtering should be done using scripts in the analysis folder. This is appealing, since sometimes we can be testing if a certain preprocessing method is more suitable than other.
  3. Use the scripts in the data_raw folder to create a standard representation of the minimal necessary data for recreating the binary objects we need for preprocessing and filtering. In my case, I should generate CSV (or standard alternative) files for the Red and Green signals of the arrays. Afterwards, I can work from there if I am able to reconstruct the binary objects.

It is not more than a technical issue, but it has made me think about what really is “raw data” in my projects, and I would much appreciate any hint, help or personal use case that could help me trace the line between the original data and its analysis.

Thank you very much in advance,

Gustavo F. Bayon


any thoughts on this @benmarwick @jennybc @cboettig @jhollist


These are interesting questions, and the best answer depends on what we’re trying to optimise - ease of engagement in the analysis by other researchers; cognitive load on the original researcher preparing this compendium, etc., and what the community expectations are about what counts as data and method. I know this can get a bit messy in the mangle of practice, and it can be tricky to get a balance of organisation that follows the best practices without feeling too unnatural and complicated. I am not familiar with the norms of data sharing and organisation in molecular biology, perhaps Jenny and the others have some more specific insights from their experience with this. But based on Gustavo’s description above, I’d suggest putting these items in data/data_raw/:

  • general CSV file describing the samples along their clinical information
  • the series of IDAT files (proprietary binary format from Illumina) containing the output from the scanner.

And then other data files that you derive from these CSV and IDAT files can go in data/data_derived. For example, you say that you work with R binary representations of the data, so if you process the IDAT files and save the output as .RData or .rds files, I’d store those .RData or .rds files in data/data_derived.

This makes is nice and clear to other people browsing your compendium what are the files that you started with, direct from the instrument/clinic, and what are the files that you generated as a result of decisions you made about your analysis. Then it is will be easy for others to see your decisions and make judgement about how reliable they were. If these are landmark data that are likely to be very widely used, then perhaps they should be prepared to go in a data pkg by themselves. But assuming this is a compendium for a research paper, I think it’s better to make the workflow clear with raw -> derived in the file structure.

My suggestion would be for all R scripts for preprocessing and filtering etc. to go in analysis/scripts/ (is this the same as your option 2?). Generally I see in the literature that it is a good habit to separate data from method, so keeping scripts in a separate directory from the data files helps with this. This will make it easier for others looking at your project to quickly see how it all fits together, for example if you number your scripts 001_..., 002_..., then it’s clear what order to use them in. If your scripts are scattered across analysis/ and data/ then it may be harder for other researcher to know how to make use of them to reproduce and extend your work.

If you do a set of preprocessing and filtering steps repeatedly on many data files, it may be more efficient for these steps to be documented functions in the R/ directory of the compendium-package, rather than scripts.

A makefile or R Markdown file that puts the preprocessing, filtering and analysis steps in order should probably go in the analysis/ directory.

Incidentally, the rrrpkg essay has a new version in the form of a pre-print here: https://peerj.com/preprints/3192/, soon to appear in The American Statistican

We also have an R pkg https://github.com/benmarwick/rrtools, based on the rrrpkg ideas, to help with quickly getting started on making a basic compendium suitable for doing reproducible research with R.

Let us know how you go!


First of all, let me thank you for the thorough answer. Sometimes one is not able to see further from a small technical problem, and does not realize that it could potentially raise more profound concerns and thoughts on other people tackling the same situations. I did not think this could be interesting at all.

I have usually worked in all my projects using a bunch of R, Python and bash scripts orchestrated by Snakemake (by the way, it is a very good tool). I have tried to keep a folder hierarchy in all of them, and I was quite happy at that. It is not uncommon at my lab to have to look at very old projects and to find oneself in a non-reproducible situation (I especially suffered a version change in R/Bioconductor affecting the majority of my codebase). I have played with Docker and Vagrant in order to preserve reproducibility but I have found myself often wondering about best practices in the field. That is when I came into the concept of Research Compendium based on an R package structure. I would not say that I am 100% convinced about it, but I think it is definitely a step in the good direction.

Sorry about the chatter. Now, for something completely different. In my projects, I have always thought of the filtering and preprocessing as steps in the analysis, because it is not uncommon to test several preprocessing algorithms for the same raw data. Thus, I would feel more comfortable having my IDATs in some kind of data_raw folder and then just coding the whole pipeline working from those. I would put my scripts inside the analysis/scripts folder, and create an analysis/Snakefile describing the pipeline. In this approximation, I would leave the data folder empty, I guess.

Problem is, I really like the idea of having tidy, clean data in the data folder so a user could just use the data() function to access it. But I still cannot figure how could one consider preprocessing as part of the pipeline and generate objects in data/ with their correspondent documentation at the same time. I think that maybe I am forcing the compendium limits. In the compendium paper, I see data/ as a starting point, as something documented and fixed from where we start the analysis. But in my projects this is not usually the form.

Unless… I store the IDAT in a data_raw folder, use some scripts to load the dataset into its binary representation (a RGChannelSet in the minfi package), and then store this object in the data/ folder using .RData format. Sorry, I am currently brainstorming, but could this be a valid approach? I think it is kind of a compromise among the options I initially described.

By the way, is it better to store raw data in data/data_raw/ (as you have proposed) than in data/raw_data (as in the README from the rrtools package)?

As a final note, I would like to say that I loved the idea of integrating the compendium and the Dockerfile so the continuous integration uses the Docker image to execute the workflow. That is the exact point I would love to arrive to. If my oldest projects here would have been designed that way, I would have at least a mechanism to know if they are still working today.

The small project that raised this question is probably being halted today, but I am going to search for another candidate to try to set up a proof of concept. I will try to keep you informed about my attempt.

Thanks again.


Yes, it seems like you’re on the right track. It can be a challenge to organise preprocessing and the main analysis in a way that is both conventional for your field and as transparent and logical as possible.

Please don’t feel too constrained by the rrtools model, for example data_raw and raw_data are interchangable (my mistake for being inconsistent!), and if it makes more sense for you to have data/data_raw, data/data_intermediate and data/data_final, then go for it.

Our most important messages with rrtools are more generic, such as separation of data and method, transparency of workflow, re-usability of data and method by other researchers, etc. Sounds like your approach is pretty well aligned with those principles, although you face challenges we have not seen in our work that led to rrtools. It would be great to take a look a repo where you’re solving these problems, if that’s possible.