Data only packages

markdanese · June 17, 2015, 9:18pm

I have been working on some scripts to download and load important health data sets. It seems like they might belong on ropensci.org, but I am wondering about “data only” packages on R. It seems that CRAN frowns upon large packages based on their guidelines. Also, GitHub policies seem to suggest the same. But obviously people still put large packages and datasets in both places.

The files I am thinking about involve publicly available data on FTP sites. But they need a lot of work to be usable in R. Most are SAS/Stata/SPSS data sets, and have to be downloaded and processed. The datasets can be big and/or numerous. All of the NHANES data, for example, are 250 MB spread across several hundred .rds files after I downloaded them all and processed them and saved them. They are bigger as raw SAS transport (.xpt) files. Other data sets like NAMCS are even harder to deal with due to multiple directories for data and documentation, and some inconsistencies in naming.

There are other health datasets that are larger, and some that require data use agreements before they can be downloaded (e.g., SEER (US Cancer registry data).

So, I have 2 sets of questions. First, how to people suggest handling this? Do we create a data-only package and try to get it on CRAN? Do we use GitHub and ignore their request not to store databases and recommendation to use Dropbox?

The second is whether it is possible to create a package that has download scripts and somehow also has the ability to import that data into the package. Or is it possible to write a script to create a data-only package on a user’s computer? The only problem with the latter is that its name might not be unique since it doesn’t go through CRAN.

Any thoughts on the best process would be appreciated.

markdanese · June 18, 2015, 5:36am

I just saw that ropensci.org does not address government or health data. I did not see that when I looked at the home page.

However, if anyone has thoughts about the above, I would love to hear it. My questions are not healthcare specific, despite my examples being so.

cboettig · June 18, 2015, 5:44am

Hi Mark,

In general it is preferable to create packages that access, process, and
tidy the canonical forms of the data, rather than redistributing the data
directly in the R package.
Why not just have a package that includes R functions for accessing and
processing the data from the FTP sites? Presumably you can script the
entire workflow for accessing & preprocessing? (I assume you’re familiar
with the haven package for processing SAS files?)

Understandably you cannot automate the ones that require data agreements;
but presumably these agreements exclude public redistribution anyway (or
the agreement would be a moot point), so packaging that data directly is
somewhat of a moot point. The best you can do is provide the scripts that
automate the process the data after a user has completed the agreement.

Generally we want to preserve & distribute the scripts involved in cleaning
the data even if the data were small enough to include in the R package
directly. I’m not sure why you want to write scripts that create a
"data-only package", I think what you want to do is create an R package
with no data, only functions that access and process the data.

While it doesn’t include the download step (since it packages data from
many files that are not even available on ftp servers), I think the baad
package recently described at length on the ropensci blog is a good example
of this general approach: https://ropensci.org/blog/2015/06/03/baad/

Maybe Scott has done an ftp-files based package already that would also be
a good template?

Cheers,

Carl

markdanese · June 18, 2015, 6:15am

Hi Carl:

Thanks very much. I was thinking of just doing a scripting package. In fact, my scripts are already on github. So, I am in complete agreement. I also just read the baad link you provided. That is completely consistent with my aims (and presented in a much more organized fashion). So, maybe just having some processing scripts is good enough.

Just to clarify my issue, my main goal was to make the process of managing and analyzing all of the data easier. I was thinking that writing a package would be easier if the data were in a known structure and a known location. Hence, the question about making a data-only package. (Or finding some other way to address the data in an unambiguous way.) NHANES is particularly troublesome with about 100 files per year, and 7 years of data. The variables can change year-to-year as well, and the file names are not perfectly consistent.

Just to answer your questions – haven is great, but does not work on SAS transport files (NHANES). For NAMCS, the raw data are fixed width files, but with SAS input statements. In both cases, there needs to be some processing of the raw data – in order to read it in properly, one has to process the SAS input file and then used that information to process the data file. For SEER, it is a combination of fixed width files and SAS input statements. (Thank goodness for the readr package for fixed width files.)

sckott · June 18, 2015, 2:27pm

Yeah, here’s an example in a function in the rnoaa package https://github.com/ropensci/rnoaa/blob/master/R/storms.R of getting data from an ftp server. We check for a file already on the user’s disk in a certain location, and if not there, download the data - then do any downstream processing. I do the caching in the case where data is quite large. For small data, caching is not really worth it since the getting data part is quick.

@markdanese I agree that a package that includes all code to process the data is the way to go. I’d urge you to make a package rather than a set of scripts if you plan on other people using it, if possible.

karthik · June 18, 2015, 5:19pm

Also see this recent package stationaRy which similar to rnoaa does a lot of ftp requests and pre-processes the data before serving to the user.

markdanese · July 28, 2015, 8:10pm

Thanks for the suggestions. I put the first version of the package on github to make it more accessible. Still quite a few things to work out, but it does load and work. Will try for CRAN in a few weeks after I have some time to get it to a place that is might get through CRAN. Between the packages written for ROpenSci and Hadley’s book on R Packages, I was able to solve a number of issues.

sckott · July 28, 2015, 8:29pm

@markdanese That’s great you got it up on github. Is this your first CRAN submission? If so, let me know if you need any help with CRAN

markdanese · July 28, 2015, 11:18pm

Thanks Scott. Will do – I have never written a package, so this will be a first. Probably a few weeks before I fix up the examples and get a good vignette written. But I will definitely ask for help.

sckott · July 28, 2015, 11:31pm

happy to help

maurolepore · February 14, 2019, 2:35pm

Thanks @cboettig for your thoughts! Is this still the way you recommend? Or is there any more recent post elsewhere?

Topic		Replies	Views
Advice on further development of package Package Development weather	5	1234	June 5, 2017
DataPackageR or datastorr? Package Use Questions data-packages	1	928	April 8, 2019
What if raw data in package is too large? General Q&A	4	932	February 19, 2020
Using DataPackageR to create data package Pandemic Papers with Chris Knox UseCases r , package , datapackager	0	1475	June 17, 2020
Data license visibility General Q&A	18	1443	September 13, 2018

Data only packages

Related topics