What if raw data in package is too large?

peterdesmet · February 17, 2020, 9:55pm

I’m co-maintaining bioRad, a package to analyze weather radar data for biological signals (like birds). Since a large part of the functionality is analyzing weather radar data, we included a raw data file “volume.h5” in the pkg at inst/extdata, which is used in many of the examples. We’ve reduced the file size as much as we can, but we’re exceeding the pkg size of 5Mb for CRAN submission.

I’m thinking of removing the raw files from the pkg and publishing those separately as a data pacakge on Zenodo.org, but that would mean that every example using those starts with downloading a file.

Any suggestions or other better approaches?

sckott · February 17, 2020, 10:49pm

An approach I like is to put data up somewhere on the web (e.g., Zenodo), then have the user as a first step download that data with a fxn foo(), then cache the data using hoardr or the underlying pkg rappdirs, and each subsequent call to foo() uses the cached data instead of downloading.

cboettig · February 18, 2020, 12:51am

Also a big fan of publishing the data to a location like Zenodo and downloading to a local dir (e.g. rappdirs location) like Scott says. A couple additional thoughts:

For testing, I tend to include a ‘mock’ (e.g. head() or first 100 rows etc) version of the data in /inst/extdata.
You probably want to add logic to avoid downloading a copy multiple times. you can check if the file exists already in rappdirs, and skip downloading if necessary. You could also consider an approach like pins (https://pins.rstudio.com).

(side note – Zenodo downloads tend to be slower than something in an Amazon S3 store… CERN has limited bandwidth to the US apparently. But not bad for a one-time download from a versioned, archival source!)

peterdesmet · February 18, 2020, 2:08pm

Thanks a lot! I’ll look into this.

Beems1 · February 19, 2020, 9:56pm

thank you a lot for your extremely helpful post as i need info for my vegetarian project. i really found it useful. was thinking if i may ask you some questions if you can help me. thanks.

Topic		Replies	Views
huge raw data for the vignettes, where should it go? General Q&A	12	905	November 29, 2018
Advice on further development of package Package Development weather	5	1234	June 5, 2017
Neurobiology data Package Use Questions	9	2738	February 3, 2015
A research compendium and methylation raw data General Q&A r , data , package , reproducibility	4	1996	September 30, 2017
Data only packages Package Development	10	4081	February 14, 2019

What if raw data in package is too large?

Related topics