huge raw data for the vignettes, where should it go?

Hi, I’m working on a package for manipulating EEG data (eeguana), and that thing is that EEG raw data are huge.

I made one vignette which depends on a relatively small dataset, and it’s 68 MB. I’m not sure if I should /can include files that are so big in the package. (I’d like to add more vignettes that depend on other files as well) But without the datasets, one cannot compile the vignettes, right? Can I include already compiled vignettes that don’t need the datasets? (If so how)? Or what is the best practice in these cases?

Thanks!
Bruno

Congrats on the package.

Is the data set available publicly somewhere? If so, I’d argue that linking folks to the data set at the beginning of the vignette is enough. Another option is to trim down your data set to the bare minimum necessary for your vignette.

Honestly though, 68 mb isn’t that big. You could probably include it in your package following the guidelines here. Experiment with different compression (as outlined in that link) and you might be able to get the size down even further.

1 Like

Note than CRAN has a 5MB limit for packages, so in general , you cannot include such data in the package if you plan to send to CRAN. But even if you don’t I’d argue that keeping the package lightweight is a good idea. I would suggest making the first step in the vignette something like:

my_dir <- tempdir()
download.file("https://url.of/file", destfile=file.path(my_dir, "filename"))

You can cache this chunk to reduce how often you do this as you are developing the package.

2 Likes

Thanks for the answers. Yeah, I also read about a limit of 1MB in the guidelines of R packages.
A follow-up question: if I add the download.file, don’t I force anyways people to download the files when they install the package (or it doesn’t work like that)? And the same goes for CRAN, is it ok to have a vignette that depends on a file that’s not included in the package?

@bnicenboim what about making a docs website on github pages or similar with pkgdown or bookdown, then just link to that from your package and CRAN page, etc.?

@sckott,
I did a pkgdown website: https://bnicenboim.github.io/eeguana/. I have my vignette there. But I don’t really follow what you mean.

what I mean is - you could just not include the vignette in the package that gets sent to CRAN, or at least not include the part that requires the big file, BUT do include it in the pkgdown site

@sckott, ahh. I thought that the website was just the package vignettes and the help in another format. (I guess I can add another folder inside docs, right? And this doesn’t get installed in the user’s computer, right?)
The only problem is that this means no vignettes (since EEG files are always big).

yes, but the docs folder isn’t or shouldn’t be in the package itself - i think everyone puts docs/ in .Rbuildignore https://github.com/ropensci/monkeylearn/blob/15728eb596249ea1df7a8b9e840885d9da85a2b2/.Rbuildignore#L11 So you could have a vignette that’s in vignettes/, and included in docs/, but somehow not included in the package - not sure how to do that but I’d think it could be done, beyond my pay grade though :smile:

I couldn’t! I’m facing this issue: R package structure block chunk cache functionnality · Issue #1226 · yihui/knitr · GitHub
any experience with this?

Have you considered osf.io to host the data? 68 MB seems large even for git/GitHub.

I also wonder if the piggyback package can help here.

So far, I’m storing the data in the server of my university. But in any case the problem, maybe, is that I don’t have any vignette that can be seen offline now. All the EEG data are quite big.
But in any case, I’ll check that package.