I have been working on some scripts to download and load important health data sets. It seems like they might belong on ropensci.org, but I am wondering about “data only” packages on R. It seems that CRAN frowns upon large packages based on their guidelines. Also, GitHub policies seem to suggest the same. But obviously people still put large packages and datasets in both places.
The files I am thinking about involve publicly available data on FTP sites. But they need a lot of work to be usable in R. Most are SAS/Stata/SPSS data sets, and have to be downloaded and processed. The datasets can be big and/or numerous. All of the NHANES data, for example, are 250 MB spread across several hundred .rds files after I downloaded them all and processed them and saved them. They are bigger as raw SAS transport (.xpt) files. Other data sets like NAMCS are even harder to deal with due to multiple directories for data and documentation, and some inconsistencies in naming.
There are other health datasets that are larger, and some that require data use agreements before they can be downloaded (e.g., SEER (US Cancer registry data).
So, I have 2 sets of questions. First, how to people suggest handling this? Do we create a data-only package and try to get it on CRAN? Do we use GitHub and ignore their request not to store databases and recommendation to use Dropbox?
The second is whether it is possible to create a package that has download scripts and somehow also has the ability to import that data into the package. Or is it possible to write a script to create a data-only package on a user’s computer? The only problem with the latter is that its name might not be unique since it doesn’t go through CRAN.
Any thoughts on the best process would be appreciated.