huge raw data for the vignettes, where should it go?

bnicenboim · November 15, 2018, 1:53pm

Hi, I’m working on a package for manipulating EEG data (eeguana), and that thing is that EEG raw data are huge.

I made one vignette which depends on a relatively small dataset, and it’s 68 MB. I’m not sure if I should /can include files that are so big in the package. (I’d like to add more vignettes that depend on other files as well) But without the datasets, one cannot compile the vignettes, right? Can I include already compiled vignettes that don’t need the datasets? (If so how)? Or what is the best practice in these cases?

Thanks!
Bruno

rywhale · November 15, 2018, 2:41pm

Congrats on the package.

Is the data set available publicly somewhere? If so, I’d argue that linking folks to the data set at the beginning of the vignette is enough. Another option is to trim down your data set to the bare minimum necessary for your vignette.

Honestly though, 68 mb isn’t that big. You could probably include it in your package following the guidelines here. Experiment with different compression (as outlined in that link) and you might be able to get the size down even further.

noamross · November 15, 2018, 2:55pm

Note than CRAN has a 5MB limit for packages, so in general , you cannot include such data in the package if you plan to send to CRAN. But even if you don’t I’d argue that keeping the package lightweight is a good idea. I would suggest making the first step in the vignette something like:

my_dir <- tempdir()
download.file("https://url.of/file", destfile=file.path(my_dir, "filename"))

You can cache this chunk to reduce how often you do this as you are developing the package.

bnicenboim · November 15, 2018, 8:17pm

Thanks for the answers. Yeah, I also read about a limit of 1MB in the guidelines of R packages.
A follow-up question: if I add the download.file, don’t I force anyways people to download the files when they install the package (or it doesn’t work like that)? And the same goes for CRAN, is it ok to have a vignette that depends on a file that’s not included in the package?

sckott · November 15, 2018, 8:21pm

@bnicenboim what about making a docs website on github pages or similar with pkgdown or bookdown, then just link to that from your package and CRAN page, etc.?

bnicenboim · November 15, 2018, 8:24pm

@sckott,
I did a pkgdown website: https://bnicenboim.github.io/eeguana/. I have my vignette there. But I don’t really follow what you mean.

sckott · November 15, 2018, 8:31pm

what I mean is - you could just not include the vignette in the package that gets sent to CRAN, or at least not include the part that requires the big file, BUT do include it in the pkgdown site

bnicenboim · November 15, 2018, 8:41pm

@sckott, ahh. I thought that the website was just the package vignettes and the help in another format. (I guess I can add another folder inside docs, right? And this doesn’t get installed in the user’s computer, right?)
The only problem is that this means no vignettes (since EEG files are always big).

sckott · November 15, 2018, 8:56pm

yes, but the docs folder isn’t or shouldn’t be in the package itself - i think everyone puts docs/ in .Rbuildignore https://github.com/ropensci/monkeylearn/blob/15728eb596249ea1df7a8b9e840885d9da85a2b2/.Rbuildignore#L11 So you could have a vignette that’s in vignettes/, and included in docs/, but somehow not included in the package - not sure how to do that but I’d think it could be done, beyond my pay grade though

bnicenboim · November 15, 2018, 9:12pm

I couldn’t! I’m facing this issue: R package structure block chunk cache functionnality · Issue #1226 · yihui/knitr · GitHub
any experience with this?

wlandau · November 26, 2018, 2:18am

Have you considered osf.io to host the data? 68 MB seems large even for git/GitHub.

wlandau · November 28, 2018, 9:59pm

I also wonder if the piggyback package can help here.

bnicenboim · November 29, 2018, 5:04am

So far, I’m storing the data in the server of my university. But in any case the problem, maybe, is that I don’t have any vignette that can be seen offline now. All the EEG data are quite big.
But in any case, I’ll check that package.

Topic		Replies	Views
What if raw data in package is too large? General Q&A	4	933	February 19, 2020
How to precompute package vignettes or pkgdown articles Blog	5	741	October 3, 2022
Data only packages Package Development	10	4081	February 14, 2019
Advice on further development of package Package Development weather	5	1234	June 5, 2017
DataPackageR or datastorr? Package Use Questions data-packages	1	928	April 8, 2019

huge raw data for the vignettes, where should it go?

Related topics