Creating Persistent Metadata for an R Object for Data Provenace

jpshanno · July 18, 2018, 2:40am

Thank you all for your replies! There are some great points to think on, and after taking them all together I’m starting to get an idea of the direction(s) we could go. I know that returning a list of dataframes is probably the easiest option, and one that most users would be familiar with. But the primary use-case will likely be importing a sensor or instrument’s data in order to do some QAQC and then roll into exploration and analysis. Personally a list with only one element containing sensor data would not be the output I would want returned in that case, becauseI would just end up pulling that dataframe out for the next steps. Keeping the data and the header data in separate dataframes solves that problem, but creates the new one of retaining the link. In reality I’m starting to think this is an issue we can only take so far, and it is up to the user to follow some best practices to keep the data and the header data linked and accessible as they work towards archiving (we envisioned EML and a public repository as the far end of this pipeline).

Drawing on all three of your suggestions I can see two approaches: store the header data in a temporary file that can be linked back to the data with helper functions, or make the header data a sticky attribute of the input_source column (which holds the file name of the original raw data ingested).
With the first option we could have an ‘ingest_header(input_source)’ or some such function that loads the temporary file. The temporary file would be created by the original ingest_* call and placed in a temp directory that is named using information about the session. That would make sure that ingest_header(input_source) would only work if ingest_*(input_source) was called during the same session.
After playing around with the ‘sticky’ package, it seems like a viable option if we assigned the header data as an attribute to the input_source column (making the whole dataframe sticky doesn’t help as the column names are currently lost when subsetting), We could make some simple helper functions to extract the header data into a new data frame, or to write it to a file, which would help users who aren’t familiar with attributes. This column shouldn’t be manipulated in place by the user, and as long as that doesn’t happen it looks like a pretty stable option.

I will probably try to incorporate a vignette or at least a wiki page in the repo for now while I explore some of the options.

A few individual responses:
@cboettig, the relational database model is definitely how we were thinking of the problem (most users will have many rows of data from a relatively small number of sensors). I’m hung up trying to decide the best way to keep that structure without a database, or without outlining an entire workflow.

@technocrat, I like the idea of saving the header data and keeping it mapped to the R object, and I think that is something we may be able to build off of, avoiding popping an unexpected object into the environment. My original post wasn’t entirely clear about our anticipated use case, which simplifies some of the complexities you point out. Each call to an ingest_* function will pull in raw data from a single sensor or instrument data file, so we’re assuming no pre-processing (in fact that’s what we’re trying to get rid of, so that everything is documented in a user’s code), and only one source for each row of data.

@isteves, the bind_rows problem is one that would be an issue for sure, I haven’t thought of a good way around it (aside from suggesting not to use it and providing an alternative). I’m not familiar with dataspice, hopefully it has some inspiration for us too. I like the lists approach in the case you describe, where the metadata describe the data. In our case the metadata generally describes the sensor, not the data, so I see it as a bit of a different problem.

Topic		Replies	Views
advice on best practices for supplying data to a package function Package Development	5	953	January 17, 2019
What if raw data in package is too large? General Q&A	4	863	February 19, 2020
Building Reproducible Data Packages with DataPackageR Blog onboarding , package , reproducibility , datasharing	3	753	July 28, 2023
Track fast-evolving custom R scripts via `freezr` Package Use Questions	3	1284	October 3, 2017
How robust is rfigshare to manage data versioning and metadata through R in figshare? Package Use Questions	1	473	May 13, 2021

Creating Persistent Metadata for an R Object for Data Provenace

Related Topics