Best practice for documenting raw data in a package?


via Jakub Nowosad on twitter

Thoughts community?


I think it’s pretty safe to say that “best practice” for documenting data is to use something machine-readable and not something that isn’t a plain text format. What do others things about this statement?

Now, what that that machine-readable thing is and where it goes is a great space for discussion…

I would recommend that raw data be documented using the EML package. This package produces machine-readable XML metadata all from within R according to the Ecological Metadata Language XML schema which is in wide use around the world. There are many other XML metadata formats which would be reasonable to use but having the EML R package lets us stay within the R ecosystem which I think is a benefit. And EML is a really powerful and flexible metadata schema.

As for the how, I think it makes sense that one EML XML file would be produced for each dataset included in your package and I guess a good place would be right next to the data file(s) inside the package.

I’m sorta steeped in this world and I’m hoping others will have some quite different ideas about how to do this.


At OS Codefest, @sckott, I, and others planned the ROpensci datapack package to be a container for data in R that would include documentation in multiple formats. We also planned to make those data easily loadable in R using lazy-loading, but that work has yet to be done – see for some use cases that people were thinking of. I’m totally with Bryce on the EML path being a good one, but I also think it would be good to be able to associate any metadata document with the data files, which is what datapack allows. See I’d love to get the lazy loading feature working.