rOpenSci and standard scientific data formats

Hello all, I am interested in discovering rOpenSci contributors that are using the Hierarchical Data Format and how they are using it. A package called rhdf5 provides a general interface to HDF5 in the Bioconductor community. Are there other similar packages used in the rOpenSci community?

Hi Ted,

Good question, it’s certainly something that’s come up before, hopefully others who use the format might chime in. On the CRAN side there used to be the hdf5 package, which appears to have been archived as un-maintained, and more recently appears to be replaced by h5 package; not sure how it compares to the rhdf5 implementation from Bioconductor. Also there’s RNetCDF which I seem to recall us discussing before – iirc netcdf is supposed to be compatible / designed around hdf5, at least in it’s current implementation, but that there were some issues with RNetCDF only supporting an earlier version or something. I could have that all wrong or maybe it has since been fixed, Scott or others might know more.

Personally I’m curious about the metadata structure in hdf5, and ropensci exploring interoperability between that representation and similar things like https://github.com/ropenscilabs/datapkg , https://github.com/ropensci/datapack, and others.

Carl,

Nice to hear from you and thanks for the pointers. I will try to track down the author of h5 and take a look at the netCDF stuff at some point…

The metadata structures in HDF are very flexible. Attributes and groups with attributes can be added to any object in the file. I have been doing some work focused on adding metadata in any XML dialect into native HDF groups and attributes and then being able to extract it from the HDF file back into XML. At present this requires XSLT2, which is a bit of a challenge in some situations…

I took a quick look at datapkg and data pack. I am working with Matt and Peter on an NSF project related to metadata evaluation and improvement…

What do you have in mind?

Ted

Hi Ted,

Sounds interesting! Yup, I’m interested in interoperability of data and associated metadata in general. Flexible is great, but can be challenge if it ends up being functionally equivalent to unstructured. The HDF / XML translation stuff sounds interesting, though like you say we lack particularly robust XSLT support in R at least (there’s Duncan’s Sxslt on Omegahat, but nothing on CRAN)

I don’t know a whole lot about HDF5, but on the NetCDF front, I believe https://cran.rstudio.com/web/packages/ncdf4/ package is the go to package for that format, at least it has more reverse imports/suggests than https://cran.rstudio.com/web/packages/RNetCDF/

@sckoot - There is some confusion on many fronts about the relationship between NetCDF4 and HDF5, so I wanted to add a bit of information to this discussion. There is actually no netCDF4 format. It is all HDF5. So, when people are using netCDF, they are getting to the actual data through the HDF library with the netCDF conventions. It would be interesting to understand the details in the R case. In the python world there was some recent work aimed at streamlining access to netCDF data: https://github.com/shoyer/h5netcdf

Thanks for the clarification! [quote=“tedhabermann, post:6, topic:414”]
In the python world there was some recent work aimed at streamlining access to netCDF data: GitHub - h5netcdf/h5netcdf: Pythonic interface to netCDF4 via h5py
[/quote]

Interesting. Do you think a similar effort in the R world makes sense?

Scott - An interesting question. There seems to be more diversity in R-world than in python-world for whatever reason. In python everyone is trying to coalesce around h5py (e.g. https://hdfgroup.org/wp/2015/09/python-hdf5-a-vision/). If the current model is working well, might not make sense to tweak it. At the same time, there may be some limitations that come along with the netCDF data model… and there may be performance issues as data sets become larger (always a can of worms). There are also challenges related to specific conventions that people use on top of netCDF, e.g. the CF conventions do not support groups and are fairly limited in metadata land. As always, Devil is in the details…