I’m developing a pkg with a colleague, and we’d like some general advice on best practices on supplying data to our custom functions (part of our discussion on this is here).
Our function takes several of the columns of a data frame to compute on. The input data come from other software, and we cannot be sure that the raw data files are exactly the same for each user of that other software (colnames, file formats, etc). I know there are some packages that do directly ingest data files, like ingestr and plater, but I’m not sure that ours deals with such standardised data files.
My view is that we should supply a data frame to the function. We should let the user read in the data from their CSV file, or whatever source, and clean, subset, etc., in whatever way they want. My idea of the best practice is like this:
# user does this
raw_messy_data <- read.csv("the-raw-data-file.csv")
# the user does various things to subset and tidy the raw data
# using base, tidyverse, data.table, whatever they like
# no need to change their column names
tidy_data <- doing_various_things(raw_messy_data)
# finally the user does this, doesn't have to alter their column names
output <- our_custom_function(tidy_data,
col1 = "my_col1",
col2 = "my_col2")
But my collaborator prefers to supply the data to the function by giving the file name directly. They would put instructions in the help file that tells the user how to edit their data to make it useable with the function. They want the function to do more for the user, but also edit the data outside of R to make it usable with the function:
# user must edit the raw file to change the colnames to "col1" and "col2"
# because these are hard-coded into the function
output <- our_custom_function("the-raw-data-file.csv")
So, we’re looking for pointers to authoritative sources on what the best practices are on this design question.
I rarely see pkg functions that directly ingest data files from a user’s hard drive, and I assume this is to give maximum flexibility to the user.
I also rarely see help pages in R pkgs that tell the user to change the column names of their data before they use it with a function.
I’ve had a look through Writing R Extensions, the rOpenSci Packages Dev Guide and Hadley’s R packages, but haven’t seen anything that directly addresses this.
My observation is that the norm for R pkgs is for the user to tell the function what to do (i.e. the user tells the function which columns to work on via arguments to the function). An example of this is in ggplot2:
# user tells the function what their data is, and what the names of the columns are for the function to work on
ggplot(data = my_data,
aes(x = my_column1,
y = my_column2))
This philosophy can be contrasted with my collaborator, who believes that the function should tell the user what to do, via instructions to the user to change the column names, etc. before using the data in the function. His paradigm is something like:
# user does this, because the help file tells them they need cols called 'x' and 'y' in their data
names(my_data)[1:2] <- c('x', 'y')
# then uses the function, col names are hard coded into the function, the user has to ensure their data have those colnames
ggplot(my_data)
I guess this is more of a convention I’ve observed in the culture of R programmers and users - to give the user maximum flexibility with a function - than a firm rule written into the manuals. And it might seem obvious to the community here what the best practice is. Anyway, I’d be most grateful for some reasons and sources about why we should prefer one approach (i.e. mine ) over another to help my collaborator and I decide how to proceed. Thanks!