Overlap policy for package onboarding

We recently received a submittal from @gshotwell of the convertr package for unit conversion. There’s potential overlap of the package with some others, including udunits2, weathermetrics, and datamart. While we’ve asked about overlap in our onboarding questions, we haven’t really developed any specific guidance about package overlap, so I thought it would be useful to have a discussion here about it. Some questions:

  • Should overlap in functionality be a criteria for rejection?
  • Should overlap only matter if the other packages meet our package guidelines/standards? Or should meeting CRAN standards be enough?
  • Should we only be concerned about overlap with other ROpenSci packages?

It makes me think of a project I have to compare R packages that help convert/analyse accelerometer data before writing one with one colleague. We want to first see what functionalities exist, how easy it is to use the existing packages, etc. before deciding whether to write a new package/what to put in the new package.

I guess that if the authors of a package know it overlaps with other packages they have tested the other packages? Might Ropensci be a home for a tutorial that compares several packages? If the authors can provide a comparison of the packages then it’d be easier to see what added value the submitted package has. And the tutorial/benchmarking itself, e.g. a RMarkdown document, could be really valuable for someone looking how to best do A, A being the thing packages overlap about. The benchmarking could be a very small use case. The authors would have to spend time doing the benchmarking but everyone would get a nice overview of how to do A in R. :slight_smile:

That’s a great point. I developed most of convertr without even knowing that udunits2 existed. Another thought is to have a package wishlist somewhere that people could comment on to identify gaps for other people to attempt to fill.

Some background on the submission. Initially I developed convertr, then learned about udunits2 and said basically “well, no need for this package.” Then recently I thought that unit conversion was a good use case for an RStudio add-in because that would make it easier to see what units were available (the initial version of convertr came with a shiny app, but not an add-in). After that I decided that it might be useful enough to release because there were a couple differences in usability between the two packages. So there’s still a lot of overlap in what you can do, but some difference in how you can do it. I guess the only other thing is that convertr looks to be about twice as fast (this surprises me, but probably doesn’t matter much for this use case):

x <- sample(1:100, 1000000, replace = TRUE)
microbenchmark(convertr::convert(x, "kg", "g"), udunits2::ud.convert(x, "kg", "g"))

>Unit: milliseconds
>          expr             min      lq         mean    median   uq          max            neval
>      convert(x, "kg", "g") 15.72549 18.37563 23.23852 20.49377 22.83835 80.07975   100 
>   ud.convert(x, "kg", "g") 38.64360 45.63138 60.38057 49.13104 89.14825 99.99149   100

I think the overlap conversation probably also should include “is this package better than the alternatives”, this could either be in terms of usability, speed, or just complying with the ROpensci guidelines for package development.

Generally, I really like the idea of having a list of common tasks with a recommended package, kind of like a thewirecutter.com for R packages. This would help both in identifying package needs, and also for a new researcher figuring out which packages are useful, stable, and supported.

Anyway, thanks for taking a look at all this!

I have mixed feeling about this. In terms of overlapping software just being out there, I have no issues with that. The more the merrier and the users really determine what is good and what is subpar. But that said, we are not PLOS One (will accept papers for technical correctness but not novelty) and should ideally strive to add things to our suite that are not being duplicated elsewhere, or done better elsewhere (ie. a poorer version of ggplot2 is something we wouldn’t accept).

Perhaps this could be a new question on that list. “Are you aware of other software that provides the same functionality? If yes, what is different about your implementation?”

Good idea @maelle - I guess it’d make sense if the tutorial included at least one of the ropensci pkgs, or at least covered a topic area we cover

We do already have that question i think - Q6 here https://github.com/ropensci/onboarding/blob/master/issue_template.md Which he did answer Convertr package: Extensive unit conversion with a shiny gadget. · Issue #40 · ropensci/software-review · GitHub

A note about licenses:

  • udunits2 has a GPL license CRAN - Package udunits2 - whereas the submission has CC0 (though MIT might be more appropriate if you want as open as possible, but w/ a more appropriate license for software) - Anyway, the GPL means that companies won’t touch it, so there is that advantage for a package that doesn’t have a GPL

@gshotwell Does your package cover more or less or the same number of conversions? Are there more conversions that can be added?

Thanks for the feedback, I switched the license to MIT.

I think there’s practically the same number of conversions. convertr allows all possible conversions between 1511 different units. I’m not exactly sure how many units are in the udunits2 library because it’s made up of four different XML databases which I haven’t take the time to parse. I have all the units from thePOSC Units of Measure Dictionary v2.2 and Wikipedia. So practically I think they have the same coverage (I think it’s a very rare case for someone to need a unit that’s not covered in convertr) but my guess is that udunits2 is more complete, I just can’t really articulate how. If I knew the gap between the two libraries it would be easy to add the additional units to convertr.

Let me try to articulate a policy incorporating feedback so far:

An R package that replicates the functionality of an existing R package may be considered for inclusion
in the ROpenSci suite if it significantly improves on alternatives in any repository (RO, CRAN, BioC) by being:

  • More open in licensing or development practices.
  • Broader in functionality (e.g., providing access to more data sets, providing a greater suite of functions), but not only by duplicating additional packages
  • Better in usability and performance
  • Actively maintained while alternatives are poorly or no longer actively maintained

These factors should be considered as a whole to determine if the package is a significant improvement. A new package would not meet this standard only by following our package guidelines while others do not, unless this leads to a significant difference in the areas above.

This determination should be made by an editor rather than being sent to reviewers.

Thoughts?

Great thread here, and nice to see this getting spelled out better for future. Very nice summary.

One more thing I wonder if we should address / ask is whether the author has determined if the features being proposed would be better contributed as a pull request / patch / extension of those existing packages, rather than a new approach entirely.

I do appreciate @gshotwell’s point about the external lib dependencies for udunits2 being a bit of a nuisance, but not sure if dependency issues fall under the ‘better usability’ bullet or merit a separate bullet? Or perhaps that by itself is insufficient reason and shouldn’t be a new bullet?

I like Karthik’s point and wonder if it can be reflected more in the policy – e.g. while we feel a package that doesn’t distinguish itself from overlapping alternatives may not be a good fit for the rOpenSci suite, that isn’t to discourage overlap in general, where the more-the-merrier and users can figure out what they prefer.

I think it would fall under “more open development practices”, just because it’s a bit harder to understand a wrapper than vanilla R code. I should say I was confused about how udunits2 works (I thought it was a remote API, but it actually all seems to be packaged together for local use). That said, trying to go through it to figure out “what conversion factor is being used to convert hogsheads to drams” is still pretty tricky.

First, an update to the policy. If this is OK I’ll update our policies document, which I’m in the process of moving into the onboarding repo in this consolidation PR.

rOpenSci encourages competition among packages, forking and re-implementation as they improve options of users overall. However, as we want packages in the rOpenSci suite to be our top recommendations for the tasks they perform, we aim to avoid duplication of functionality of existing R packages in any repo without significant improvements. An R package that replicates the functionality of an existing R package may be considered for inclusion in the ROpenSci suite if it significantly improves on alternatives in any repository (RO, CRAN, BioC) by being:

  • More open in licensing or development practices.
  • Broader in functionality (e.g., providing access to more data sets, providing a greater suite of functions), but not only by duplicating additional packages
  • Better in usability and performance
  • Actively maintained while alternatives are poorly or no longer actively maintained

These factors should be considered as a whole to determine if the package is a significant improvement. A new package would not meet this standard only by following our package guidelines while others do not, unless this leads to a significant difference in the areas above.

We encourage developers whose packages are not accepted due to overlap to still consider submittal to other repositories or journals.


Now, in making a determination for convertr, I think the primary improvement it offers from a user point of view is that it does not require the installation of a C package, which is a barrier for some users, especially those using Windows. From an open development perspective, udunits2 is a thin wrapper around the C package, but the C package is also well documented and actively and openly maintained. While we are opinionated about licenses in our packages, I think any open license if fine for this determination as long as it doesn’t appear to be a barrier to development. The functionality of the packages are similar as are the units in the databases.

In the end, I come down slightly on the side of acceptance because of the installation issue.

If the other editors agree with this, I’ll try to expedite your review, @gshotwell.

@noamross the policy LTGM

Agree based on your points. And without a C dep, this submission is less complicated, there’s less code, and presumably easier to contribute to (in theory)

I have to say that like netCDF, installing the udunits package is more than a nuisance :sob:. It requires installing the udunits software, machine-specific arguments to install.packages, and in some cases adding ‘module load’ instructions to .bashrc (which as a kicker isn’t read by emacs using tramp). For example

@dlebauer Yikes, I don’t think it should be that bad. What system are you on? On debian/ubuntu, you should just need to ‘apt-get install libudunits2-dev’, then install.packages(“udunits2”) and be good to go. (I’m still fuzzy on when the mac & windows binaries bundle these things and when & how they are supposed to be installed separately). It is a pity the package author didn’t document clearly which library to install – I do appreciate how Jereon’s packages are always very explicit about this in the SystemRequirements section.

Some of you know about Gabor’s really awesome work with r-hub, for instance, his sysreqs api will give you the the precise names of the external dependencies for different platforms on packages like rgdal: https://sysreqs.r-hub.io/pkg/rgdal or rjags: https://sysreqs.r-hub.io/pkg/rjags (It doesn’t look like Gabor has gotten to udunits2 yet.)

I hadn’t seen r-hub. It looks interesting but its not immediately clear how to use the information in those links in a script, though it certainly would be great if the install.packages would look up the dependency I needed and found where it was installed or installed it instead of giving an error.

apt-get on Ubuntu works great. RedHat/CentOS and OSX are more troublesome in my experience and some things just don’t work or get overly complicated to install on Windows. Usually on RedHat computing clusters I have to request things like netcdf and udunits be installed or install them locally under /home. All of which is a barrier to using a package and a time sink.

That said udunits2 and these other packages are fantastic, which is why I am willing to jump through hoops to get them to work. But I think if install.packages('convertr') ‘just works’, it will be an improvement.

@dlebauer I understand that the sysreqs API should still be considered pretty early dev, I think Gabor is mostly building it to support the needs of installation in r-hub anyway, but it does list RedHat/Centos and OSX brew dependency versions most of the time as well.

I agree that such external dependencies are a real kicker on remote machines / clusters where you don’t have root. That’s why I’m excited to see things like NSF’s newer XSEDE projects like Jestream and Chameleon supporting virtual environments and I can just pull my Docker container with libs already installed and ready to go. (Meanwhile I can’t even run much of my code on our campus cluster, because they deny my request to offer more updated R version in their module, so I feel your pain).

So yes, we do both agree that dropping the external dependency is a meaningful improvement. (though maybe there’s also an element of trust/reliability in a tool that is built around a widely used standard library instead of a custom implementation).