First of all, let me thank you for the thorough answer. Sometimes one is not able to see further from a small technical problem, and does not realize that it could potentially raise more profound concerns and thoughts on other people tackling the same situations. I did not think this could be interesting at all.
I have usually worked in all my projects using a bunch of R, Python and bash scripts orchestrated by Snakemake (by the way, it is a very good tool). I have tried to keep a folder hierarchy in all of them, and I was quite happy at that. It is not uncommon at my lab to have to look at very old projects and to find oneself in a non-reproducible situation (I especially suffered a version change in R/Bioconductor affecting the majority of my codebase). I have played with Docker and Vagrant in order to preserve reproducibility but I have found myself often wondering about best practices in the field. That is when I came into the concept of Research Compendium based on an R package structure. I would not say that I am 100% convinced about it, but I think it is definitely a step in the good direction.
Sorry about the chatter. Now, for something completely different. In my projects, I have always thought of the filtering and preprocessing as steps in the analysis, because it is not uncommon to test several preprocessing algorithms for the same raw data. Thus, I would feel more comfortable having my IDATs in some kind of
data_raw folder and then just coding the whole pipeline working from those. I would put my scripts inside the
analysis/scripts folder, and create an
analysis/Snakefile describing the pipeline. In this approximation, I would leave the
data folder empty, I guess.
Problem is, I really like the idea of having tidy, clean data in the
data folder so a user could just use the
data() function to access it. But I still cannot figure how could one consider preprocessing as part of the pipeline and generate objects in
data/ with their correspondent documentation at the same time. I think that maybe I am forcing the compendium limits. In the compendium paper, I see
data/ as a starting point, as something documented and fixed from where we start the analysis. But in my projects this is not usually the form.
Unless… I store the IDAT in a
data_raw folder, use some scripts to load the dataset into its binary representation (a
RGChannelSet in the minfi package), and then store this object in the
data/ folder using .RData format. Sorry, I am currently brainstorming, but could this be a valid approach? I think it is kind of a compromise among the options I initially described.
By the way, is it better to store raw data in
data/data_raw/ (as you have proposed) than in
data/raw_data (as in the README from the rrtools package)?
As a final note, I would like to say that I loved the idea of integrating the compendium and the Dockerfile so the continuous integration uses the Docker image to execute the workflow. That is the exact point I would love to arrive to. If my oldest projects here would have been designed that way, I would have at least a mechanism to know if they are still working today.
The small project that raised this question is probably being halted today, but I am going to search for another candidate to try to set up a proof of concept. I will try to keep you informed about my attempt.