What if raw data in package is too large?

:wave: I’m co-maintaining bioRad, a package to analyze weather radar data for biological signals (like birds). Since a large part of the functionality is analyzing weather radar data, we included a raw data file “volume.h5” in the pkg at inst/extdata, which is used in many of the examples. We’ve reduced the file size as much as we can, but we’re exceeding the pkg size of 5Mb for CRAN submission.

I’m thinking of removing the raw files from the pkg and publishing those separately as a data pacakge on Zenodo.org, but that would mean that every example using those starts with downloading a file. :man_shrugging:

Any suggestions or other better approaches?

3 Likes

An approach I like is to put data up somewhere on the web (e.g., Zenodo), then have the user as a first step download that data with a fxn foo(), then cache the data using hoardr or the underlying pkg rappdirs, and each subsequent call to foo() uses the cached data instead of downloading.

2 Likes

Also a big fan of publishing the data to a location like Zenodo and downloading to a local dir (e.g. rappdirs location) like Scott says. A couple additional thoughts:

  1. For testing, I tend to include a ‘mock’ (e.g. head() or first 100 rows etc) version of the data in /inst/extdata.

  2. You probably want to add logic to avoid downloading a copy multiple times. you can check if the file exists already in rappdirs, and skip downloading if necessary. You could also consider an approach like pins (https://pins.rstudio.com).

(side note – Zenodo downloads tend to be slower than something in an Amazon S3 store… CERN has limited bandwidth to the US apparently. But not bad for a one-time download from a versioned, archival source!)

3 Likes

Thanks a lot! I’ll look into this.

1 Like

thank you a lot for your extremely helpful post as i need info for my vegetarian project. i really found it useful. was thinking if i may ask you some questions if you can help me. thanks.