Creating Persistent Metadata for an R Object for Data Provenace

I started working with two collaborators on an R package (https://github.com/jpshanno/ingestr) to ingest raw instrument and sensor data directly into R from it’s native format (or a standard txt/csv export format) and perform the wrangling necessary to return a tidy data frame. One of the issues identified at the IMCR hackathon where ingestr was conceived was to create tools for information managers that make it easier to track data provenance from raw data to a fully documented archived/published dataset. Additionally we hope to lower the barrier to entry for using R in a researcher’s or data manager’s workflow.

At the least, the provenance of raw sensor data should include the input source file (easy to add as a column) and any of the header data associated with the file (e.g. equipment serial number, download date, etc). The second part of the provenance is where we would appreciate some feedback. At the hackathon we designed the ingest_* functions to return the dataframe, and then assign a dataframe of header information into the parent environment. We know that is not best practice for R functions and are looking for alternatives, which is why I am posting here. We are hoping for some feedback on potential solutions before we invest time creating function templates and internal standards to use in the package.

The ideal solution would 1) return persistent header data that can be easily linked with the data, 2) be pipe-friendly to work with a sister-package (https://github.com/IMCR-Hackathon/qaqc_tools) or other packages, and 3) provide a friendly user-experience for new R-users. Some potential solutions to storing metadata with an R object has been discussed in this thread.

Solutions we have considered

  1. Add attributes to our dataframes and create some helper functions to easily access the data for people not familiar with attributes. The big problem here is that the attributes can be lost when the dataframe is subset. Because our goal is data provenance this isn’t a workable solution with base attributes.
  2. Use a data frame with non-base attributes or columns
    1. The sticky package provides a way of creating persistent or ‘sticky’ attributes for a dataframe. We haven’t experimented with this package yet to see if it is pipe-friendly, or if it would cause errors if the dataframe were used in an analysis step (for example running it through lmer).
    2. It may be possible to do something like the sticky geometry column in sf, but we haven’t delved into that yet.
  3. We could return a list with a data object and header object(s), but that makes it harder to keep the data and header info together when a user wants to manipulate or use the data
  4. We could use some other data structure, S4 or R6. None of the three of us have much experience using either of these classes, but if they meet the above requirements would be happy to use them.
  5. Is there another option that we have completely overlooked?

Does anyone have any thoughts on a best practices solution to this problem? Right now we are holding off on implementing any new file types until we have a solution in place.

1 Like

This is a great question and one which I have thought a lot about, though I certainly don’t have all the answers.

I would strongly advise against solution 1. See Hadley’s remarks on metadata and this thread, https://twitter.com/i_steves/status/1017569725340151809.

I’d also recommend against 4; I don’t think this is a good use case for S4 or R6 (e.g. you don’t need inheritance or reference class structures and besides there are imho few good reasons to use S4… See EML & RNeXML for my, um, battle scars?

I do appreciate the value of having a single object where the bits travel together, particularly with pipes, and I think the sf model (and possibly similar ideas in tidygraph and ideas from Michael Sumner) are promising, but I think it’s also possible to overstate these values.

Going back to Hadley’s comments, it’s worth remembering that the relational database model is really pretty damn amazing; powerful, simple, and well tested. That is, data can be split over two separate tables which can refer to each-other using foreign keys. If you want one table, just join them – this creates duplication (i.e. metadata gets repeated over every observation), but it tends not to be a problem unless the data are huge; in which case you want a relational database anyway.

I have gone down some other roads (maybe I’m still going down some – RDF all the data! Use JSON-LD!) but I think that with a well-considered relational data model with separate tables for metadata and observations, joined by foreign keys, you cannot go too far wrong. Unlike list column tricks, this approach works across just about all tools and languages, and when the data gets big nothing will give you better performance. (Note that I’m not saying ‘use a database’, data.frames are fine, you just probably want 2 of them, or maybe more depending on the metadata).

There are weaknesses to relational database model; particularly if your data structures are going to evolve a lot over time (different instruments / etc), but that is a challenge for most other approaches as well anyway.

1 Like

I’ll be pretentious and assert that this is a problem in high dimensional space:

  1. Data payload, which may be geocoded x.y,z,m (where m is one or more attributes)

  2. Data provenance, which may be a combination of sources

  3. Data pre-processing, not all of which may necessarily be reflected in the Rmd or other literate programming documentation.

  4. Versioning of the above

One Ring to Rule Them All is a lot to ask.

I see five principal approaches:

  1. A toolchain with heavy reliance on %>% with mandatory literate programming to keep track of the dynamic pieces

  2. Enhanced use of attributes with operator overloading to force them to travel with the data, any copy or subset.

  3. Object-oriented with all operations on the content controlled by methods of or called by the class – S3, S4 or RC (see Wickham’s Advanced R, Ch 7)

  4. Encapsulating all the functionality in separate modules in a package

None of these are casual undertakings.

I do have some suggestions (mainly bad examples to be improved)

Import data with headers as the first row. Create a hash mapping to your working column names, save the hash to a serialized filename or URL and put the reference in a column (wasteful but simple). Optionally edit the hash file to add the additional information to the extent that it’s too variable to parse.

Alternatively, save it and your original data frame with a dummy key field and do an inner join. Or save the hash in an RDAand load. Or have a dummy record (first or last) with a dummy field with the hash to be filtered out by default.

sf is an S4 package with slots for CRS, coordinates and data attributes. If you will be dealing with spatial data, it’s worth finding if SpatialPolygonDataFrames can accommodate multiple data frames and how easy it is to do that.

Finally, within R, I very much doubt you’ll find much in the way of best practices on data governance unless you’re so fortunate to be working with OAI-PMH tagged data.

File under FWIW. Glad to help on discrete tasks while I’m between jobs.

1 Like

We had similar discussions when working on metajam, a package for downloading data from repositories in the DataONE network.

  1. We tried this, too, and quickly gave up on this because we kept losing the metadata.
  2. We tried 2.1 (using sticky), but depending on what function we applied to the data frame, we still lost attributes (seemed to work with base R functions, but not with tidyverse functions like bind_rows). We did not look into 2.2 so :woman_shrugging:
  3. We ended up going for a version of this approach. In our case, we broke down our problem into two functions:
    a. download_d1_data - which downloads the data and associated metadata into a folder. Most data files ended up with up to 3 metadata csv’s describing attributes (column descriptions, units, etc.), factors (i.e. CA = California, HI = Hawaii), and general metadata (author, geographic coverage, abstract, etc.).
    b. read_d1_files - which then looks at the folder created by (a) and returns a list of tibbles. That way, the user is in full control of assignment and doesn’t have variables popping into their environment out of nowhere.
  4. Didn’t go into these options too much.

I would echo Carl and also encourage the relational approach. In the end, our goal was to implement things in a way that would minimize the amount we would have to teach. Scientists are generally familiar with joins and lists (as long as they’re not too deeply nested), so I would say that those are the safest approaches.

A lot of our thought/development process was also influenced by the dataspice package.

Thank you all for your replies! There are some great points to think on, and after taking them all together I’m starting to get an idea of the direction(s) we could go. I know that returning a list of dataframes is probably the easiest option, and one that most users would be familiar with. But the primary use-case will likely be importing a sensor or instrument’s data in order to do some QAQC and then roll into exploration and analysis. Personally a list with only one element containing sensor data would not be the output I would want returned in that case, becauseI would just end up pulling that dataframe out for the next steps. Keeping the data and the header data in separate dataframes solves that problem, but creates the new one of retaining the link. In reality I’m starting to think this is an issue we can only take so far, and it is up to the user to follow some best practices to keep the data and the header data linked and accessible as they work towards archiving (we envisioned EML and a public repository as the far end of this pipeline).

Drawing on all three of your suggestions I can see two approaches: store the header data in a temporary file that can be linked back to the data with helper functions, or make the header data a sticky attribute of the input_source column (which holds the file name of the original raw data ingested).
With the first option we could have an ‘ingest_header(input_source)’ or some such function that loads the temporary file. The temporary file would be created by the original ingest_* call and placed in a temp directory that is named using information about the session. That would make sure that ingest_header(input_source) would only work if ingest_*(input_source) was called during the same session.
After playing around with the ‘sticky’ package, it seems like a viable option if we assigned the header data as an attribute to the input_source column (making the whole dataframe sticky doesn’t help as the column names are currently lost when subsetting), We could make some simple helper functions to extract the header data into a new data frame, or to write it to a file, which would help users who aren’t familiar with attributes. This column shouldn’t be manipulated in place by the user, and as long as that doesn’t happen it looks like a pretty stable option.

I will probably try to incorporate a vignette or at least a wiki page in the repo for now while I explore some of the options.

A few individual responses:
@cboettig, the relational database model is definitely how we were thinking of the problem (most users will have many rows of data from a relatively small number of sensors). I’m hung up trying to decide the best way to keep that structure without a database, or without outlining an entire workflow.

@technocrat, I like the idea of saving the header data and keeping it mapped to the R object, and I think that is something we may be able to build off of, avoiding popping an unexpected object into the environment. My original post wasn’t entirely clear about our anticipated use case, which simplifies some of the complexities you point out. Each call to an ingest_* function will pull in raw data from a single sensor or instrument data file, so we’re assuming no pre-processing (in fact that’s what we’re trying to get rid of, so that everything is documented in a user’s code), and only one source for each row of data.

@isteves, the bind_rows problem is one that would be an issue for sure, I haven’t thought of a good way around it (aside from suggesting not to use it and providing an alternative). I’m not familiar with dataspice, hopefully it has some inspiration for us too. I like the lists approach in the case you describe, where the metadata describe the data. In our case the metadata generally describes the sensor, not the data, so I see it as a bit of a different problem.