advice on best practices for supplying data to a package function

benmarwick · January 14, 2019, 1:36am

I’m developing a pkg with a colleague, and we’d like some general advice on best practices on supplying data to our custom functions (part of our discussion on this is here).

Our function takes several of the columns of a data frame to compute on. The input data come from other software, and we cannot be sure that the raw data files are exactly the same for each user of that other software (colnames, file formats, etc). I know there are some packages that do directly ingest data files, like ingestr and plater, but I’m not sure that ours deals with such standardised data files.

My view is that we should supply a data frame to the function. We should let the user read in the data from their CSV file, or whatever source, and clean, subset, etc., in whatever way they want. My idea of the best practice is like this:

# user does this
raw_messy_data <- read.csv("the-raw-data-file.csv")

# the user does various things to subset and tidy the raw data
# using base, tidyverse, data.table, whatever they like
# no need to change their column names
tidy_data <- doing_various_things(raw_messy_data)

# finally the user does this, doesn't have to alter their column names
output <- our_custom_function(tidy_data,
                              col1 = "my_col1",
                              col2 = "my_col2")

But my collaborator prefers to supply the data to the function by giving the file name directly. They would put instructions in the help file that tells the user how to edit their data to make it useable with the function. They want the function to do more for the user, but also edit the data outside of R to make it usable with the function:

# user must edit the raw file to change the colnames to "col1" and "col2" 
# because these are hard-coded into the function
output <- our_custom_function("the-raw-data-file.csv")

So, we’re looking for pointers to authoritative sources on what the best practices are on this design question.

I rarely see pkg functions that directly ingest data files from a user’s hard drive, and I assume this is to give maximum flexibility to the user.

I also rarely see help pages in R pkgs that tell the user to change the column names of their data before they use it with a function.

I’ve had a look through Writing R Extensions, the rOpenSci Packages Dev Guide and Hadley’s R packages, but haven’t seen anything that directly addresses this.

My observation is that the norm for R pkgs is for the user to tell the function what to do (i.e. the user tells the function which columns to work on via arguments to the function). An example of this is in ggplot2:

# user tells the function what their data is, and what the names of the columns are for the function to work on
ggplot(data = my_data, 
       aes(x = my_column1,
           y = my_column2))

This philosophy can be contrasted with my collaborator, who believes that the function should tell the user what to do, via instructions to the user to change the column names, etc. before using the data in the function. His paradigm is something like:

# user does this, because the help file tells them they need cols called 'x' and 'y' in their data
names(my_data)[1:2] <- c('x', 'y')

# then uses the function, col names are hard coded into the function, the user has to ensure their data have those colnames 
ggplot(my_data)

I guess this is more of a convention I’ve observed in the culture of R programmers and users - to give the user maximum flexibility with a function - than a firm rule written into the manuals. And it might seem obvious to the community here what the best practice is. Anyway, I’d be most grateful for some reasons and sources about why we should prefer one approach (i.e. mine ) over another to help my collaborator and I decide how to proceed. Thanks!

elinw · January 14, 2019, 2:55am

I think it is always best practice to hard code as little as possible. That makes it more flexible for later and also avoids user frustration. When I say hard code, I mean things like requiring specific column names. I would rather have the users give me a labelled vector with names, hash, or list that connects their column names to the names I use in the function. In other words, essentially feed the names in as parameters (but not necessarily col1 = “other name 1”, col2 = “other name 2”). I feel like anything else is just asking for errors and frustration and is kind of old school. You can have meaningful defaults so that if they choose to use those names it works without doing the assignments and also I would make it clear what the actual content of the columns is supposed to be (rather than using col1, col2). Does it actually matter what order the columns are in?

The reason I’d rather do a vector or list is that then your users can pull the column names in their tidy data frame.

Besides preferring abstraction, the less hand processing the better. Another option might be to have some additional preprocessing functions. So you could have a function that creates the hash/list/vector.

wlandau · January 14, 2019, 4:08am

Data frames vs CSV: I think you can claim that data frames are more R-focused and thus a natural choice. Tidyverse tools tend to take an in-memory data frame as a starting assumption.
Column names: for some complicated implementations of specialized parametric models, hard-coded parameters can provide necessary simplifying assumptions. But for UThwigl specifically, it sounds like the column names vary on a case-by-case basis. In cases like these, I think file paths might serve as precedent.

cboettig · January 14, 2019, 4:55am

Hi Ben,

I think these are both very valid perspectives. While I completely agree with your characterization of the typical approach for working with data.frames / csv files, their are also plenty of common examples in which a function reads a file which must be formatted in a specific way first, and the user is told to make sure the data is in that format, with said file using precise names. This is typically what we mean by schema, and can refer either to a csv file with appropriate column names, or a different kind of plain-text file, such as a DESCRIPTION file, or any of the various yaml and other structured data files we see in rmarkdown, blogdown, pkgdown etc. This is also seen in other data formats such as spatial data files.

Such files are often given particular names and sometimes particular extensions to indicate to the user that the input cannot be just “any” csv / yml / json / xml / dcf (pick your text serialization), but rather a “special” serialization conforming to a particular schema (say, DESCRIPTION vs a dcf), that is read by a custom function without additional arguments (devtools::install_deps() vs read.dcf – that directly ingest data files from a user’s hard drive). Admittedly, most of these do not use .csv input formats, though many probably could.

In defining a data schema, it’s also useful to have some notion of validating the file, so a user (or developer) can be sure the input conforms to the right names, object classes, etc.

nwerth · January 16, 2019, 3:51pm

I agree with @elinw: flexibility should be the goal. In many cases, functions are “cleaner” and adaptible if they accept vectors. For your example:

output <- our_custom_function(col1, col2) {
  tidy_data <- data.frame(col1, col2)
  ...
}

If most of your package’s functions take and return similar colections of data, consider making a class and adding constructor functions (e.g., myclass(col1, col2)). For example, the forecast object from the similarly named package.

As for directly working with files, if somebody wants to manipulate something in R, I feel it should be an object in R. So a function should read the file in as an object, but manipulating it beyond can be a problem.

benmarwick · January 17, 2019, 1:43am

Thanks everyone for the quick and detailed feedback, that’s very helpful. It seems like we do not have an authoritative pronouncement (i.e. from CRAN or similar) or publication on these kinds of design choices.

I’m seeing a majority view that functions should accept R objects (data frame, list, vector, etc.) rather than read a file from the user’s disk. But reading a file directly is perhaps not as abnormal as I originally thought. And we can define schema, classes and constructor functions to handle these.

Topic		Replies	Views
Data only packages Package Development	10	4089	February 14, 2019
Darwin Core Archive package - finch Package Use Questions	0	988	January 18, 2015
Data license visibility General Q&A	18	1457	September 13, 2018
Plate reader files Package Use Questions	16	2780	July 20, 2018
Best practice for documenting raw data in a package? General Q&A	2	1446	August 26, 2017

advice on best practices for supplying data to a package function

Related topics