I started working with two collaborators on an R package (https://github.com/jpshanno/ingestr) to ingest raw instrument and sensor data directly into R from it’s native format (or a standard txt/csv export format) and perform the wrangling necessary to return a tidy data frame. One of the issues identified at the IMCR hackathon where
ingestr was conceived was to create tools for information managers that make it easier to track data provenance from raw data to a fully documented archived/published dataset. Additionally we hope to lower the barrier to entry for using R in a researcher’s or data manager’s workflow.
At the least, the provenance of raw sensor data should include the input source file (easy to add as a column) and any of the header data associated with the file (e.g. equipment serial number, download date, etc). The second part of the provenance is where we would appreciate some feedback. At the hackathon we designed the
ingest_* functions to return the dataframe, and then assign a dataframe of header information into the parent environment. We know that is not best practice for R functions and are looking for alternatives, which is why I am posting here. We are hoping for some feedback on potential solutions before we invest time creating function templates and internal standards to use in the package.
The ideal solution would 1) return persistent header data that can be easily linked with the data, 2) be pipe-friendly to work with a sister-package (https://github.com/IMCR-Hackathon/qaqc_tools) or other packages, and 3) provide a friendly user-experience for new R-users. Some potential solutions to storing metadata with an R object has been discussed in this thread.
Solutions we have considered
- Add attributes to our dataframes and create some helper functions to easily access the data for people not familiar with attributes. The big problem here is that the attributes can be lost when the dataframe is subset. Because our goal is data provenance this isn’t a workable solution with base attributes.
- Use a data frame with non-base attributes or columns
stickypackage provides a way of creating persistent or ‘sticky’ attributes for a dataframe. We haven’t experimented with this package yet to see if it is pipe-friendly, or if it would cause errors if the dataframe were used in an analysis step (for example running it through lmer).
- It may be possible to do something like the sticky geometry column in
sf, but we haven’t delved into that yet.
- We could return a list with a data object and header object(s), but that makes it harder to keep the data and header info together when a user wants to manipulate or use the data
- We could use some other data structure, S4 or R6. None of the three of us have much experience using either of these classes, but if they meet the above requirements would be happy to use them.
- Is there another option that we have completely overlooked?
Does anyone have any thoughts on a best practices solution to this problem? Right now we are holding off on implementing any new file types until we have a solution in place.