DataPackageR or datastorr?

I received this question via email:

I just watched your talk from NYC R on data packages…I’m trying to build a data package as part of two different ongoing projects, and was wondering which of the two packages you recommend - datastorr or datapackager? The latter seems to be on ropensci, but the former isn’t? Just wondering what you recommend for best practices at present, and or if there are any big differences I should be aware of as I know this space seems to be moving pretty rapidly.

Also, do you know if either of these will work with a private repo? I

My response:

DataPackageR is a great solution for data at the package-scale, which is the first design pattern from my talk. It helps create packages where the data is inside the package, and is updated by re-installing the package. This tends to be fine for smallish data. If the package ultimately goes to CRAN, it would have a 5MB size limit. On GitHub there are larger size limits (100MB, I think), but I think in general people don’t expect packages to be vehicles for huge amounts of data.

datastorr is designed work with larger data sets (tens of MB to 2 GB). The package you create with it remains small and only needs to be installed once. Users fetch/sync versions of the data as needed, and they are stored separately from the package on the user’s hard drive.

I believe both work with private GitHub repos. Your users will need to set up a GitHub token to install/use the packages in either case. (Send them here for instructions:

DataPackageR also helps with the whole workflow and is more fully documented. datastorr leaves you to figure out a few more things on your own.

DataPackageR has gone through peer review so it has the more robust set of quality checks. datastorr has not, through it was created by rOpenSci colleagues and I use it a fair bit.

If your data exceeds expected user RAM size (1-2GB, probably), you may want to consider an approach that uses a database back-end. We don’t as full of a turn-key solution here, but it involves pieces like GitHub - ropensci/arkdb: Archive and unarchive databases as flat text files and we have in-development examples like GitHub - ropensci/taxadb: 📦 Taxonomic Database and GitHub - ropensci-archive/citesdb: ⚠ ARCHIVED A high-performance database of shipment-level CITES trade data .


Another ROpenSci data packaging solution that we maintain is:

datapack provides a BagIt-based serialization format that is consistent with the Research Data Alliance recommendations on data packaging. It also interoperates with the DataONE network of data repositories (via the dataone package), so can be used with those to store even large data packages. These repositories have the benefit of being archival quality trusted repositories that have contingency plans in place to ensure long term preservation of the data (unlike GitHub, which has both size constraints and lacks explicit archival provisions, c.f. Google Code). Having just returned from RDA’s 13th plenary last week, I ca attest that there is a lot of discussion within RDA and the data repository community overall about repository interoperability specifications that are relevant here.


1 Like