Hello everybody,
I am currently working as a bioinformatician at a molecular biology lab. I am very interested in reproducible research in general, and I find the research compendium idea to be quite an important building block for the development of better data analysis methodologies. Thus, I wanted to test the idea by myself on a small data analysis collaboration I am currently working on.
Most of my work is based on the analysis of DNA methylation microarrays. The raw data I usually receive is made of a general CSV file describing the samples along their clinical information, and a series of IDAT files (proprietary binary format from Illumina) containing the output from the scanner. There are a lot of papers in the literature talking about analysis pipelines for this type of projects, and most of them agree on several initial steps involving data preprocessing, quality check and probe filtering based on different criteria.
On the rrrpkg page describing a research compendium, the data folder is reserved for raw data in standard formats, that is, the formats a R package is expecting there: .tsv, .R, etc… At first, and based on Pakillo’s template, I decided to create a data_raw folder where I could store my raw data, and then write some R scripts for cleaning and storing it in the data folder.
Problem is, analysis of this type of data usually starts from beta values, which are indicative of DNA methylation and are obtained after the aforementioned preprocessing and filtering. This steps are designed in such a way that they are easier to tackle if we work with R binary representations of the data than if we try to load the data from standard CSV files. So I am in the mental process of deciding whether:
- Put the preprocessing and filtering code in the data_raw folder scripts, and consider it as something necessary to create the real raw data which is going to be the start point for the analysis project. In this case, all the preprocessing and filtering logic will be stored in the data_raw folder as R scripts. I do not know either if I should put these steps inside a Makefile in the analysis folder. And there is yet another problem: entries in the data folder should be documented, but if they are dynamically generated it would lead to a little bit of manual mangling each time we decide to change the filtering criteria or the preprocessing method.
- Simply use the R scripts in the data_raw folder to clean and tidy our inputs, and put them inside the data folder. Then, all processing and filtering should be done using scripts in the analysis folder. This is appealing, since sometimes we can be testing if a certain preprocessing method is more suitable than other.
- Use the scripts in the data_raw folder to create a standard representation of the minimal necessary data for recreating the binary objects we need for preprocessing and filtering. In my case, I should generate CSV (or standard alternative) files for the Red and Green signals of the arrays. Afterwards, I can work from there if I am able to reconstruct the binary objects.
It is not more than a technical issue, but it has made me think about what really is “raw data” in my projects, and I would much appreciate any hint, help or personal use case that could help me trace the line between the original data and its analysis.
Thank you very much in advance,
Gustavo F. Bayon