Feedback Wanted on rOpenSci Package Registry

The rOpenSci registry

rOpenSci packages for the most part go to CRAN eventually - though some are on Bioconductor. So, rOpenSci packages are listed on CRAN/BioC, but there are many rOpenSci packages that are not yet on either of those platforms, and perhaps may never be for various reasons. Despite rOpenSci packages being on CRAN/BioC there’s nothing actually that marks any of those packages as being part of rOpenSci.

We keep track of rOpenSci packages in a GitHub repo at https://github.com/ropensci/roregistry within a single large JSON file. We can’t simply assume all repositories in the ropensci or ropenscilabs GitHub organizations are R packages, because some are not, and some are abandoned. This registry JSON file helps us know what packages are in the rOpenSci suite, some basic metadata about them, who maintains each one, and more.

Here’s an example entry:

{
  "name": "rfishbase",
  "type": "package",
  "maintainer": "Carl Boettiger",
  "email": "cboettig@gmail.com",
  "status": "good",
  "installable": true,
  "ropensci_category": "data-access",
  "category": "biology",
  "on_cran": true,
  "on_bioc": false,
  "cran_archived": false,
  "url": "https://github.com/ropensci/rfishbase",
  "root": "",
  "fork": false,
  "description": "Access any fish data from Fishbase.org, including occurrence records, habitat data, and more",
  "badges": []
}

Time for a change?

I (Scott) maintain this registry file manually. As you can imagine, this is likely prone to being out of sync with the true state of rOpenSci packages - especially likely as our suite of packages grows. I think if we somehow automatically pull in data from CRAN or elsewhere, or have pkg maintainers submit PR’s to update the registry, we can maybe be better off.

Rather than maintaining one huge JSON file, I like the idea of maintaining a separate file for each R package. (though we still would generate the single file most likely, but would be automated) A separate file for each package would make it easier for people other than me to contribute.

Carl and others, including our interns, have been making good progress on introducing codemeta.json files to rOpenSci packages. There isn’t a large portion of packages that have them yet, but it’s getting there. This consists of adding entries to the DESCRIPTION file like https://github.com/ropensci/RNeXML/blob/master/DESCRIPTION#L82-L84 and adding a codemeta.json file like https://github.com/ropensci/RNeXML/blob/master/codemeta.json with pretty rich metadata about the package. It’s possible we can use the DESCRIPTION files with the added metadata fields and/or the codemeta files instead of the custom JSON you see above.

Given

  1. the increasing chance of the registry being out of sync with reality, and
  2. the appeal of single files per package as opposed to one huge file, and
  3. the arrival of codemeta stuff

Perhaps we should change how we construct our registry.

Feedback

What do you think? Should I keep maintaining the registry myself? Should we make something that’s all automated? Should we get all rOpenSci packages to have codemeta files, then use those? Should the registry not be a single repo, but a so to speak decentralized registry made up of codemeta files in each repo? Do you have any other ideas that may help guide us here?


cc @cboettig since we talked about this earlier

What are all the uses of the rOpenSci package registry?I see ropkgs and rostats. Does it also power rOpenSci | Packages ?

This seems like a very solid idea on both fronts. Keeping ropensci package information more up to date seems like it would benefit maintainers and users. And keeping each package’s registration information in a separate JSON overall makes more sense as ropensci continues to grow. I can’t think of a downside to moving to separate JSON files per package.

If the registry can get as much information as possible from codemeta.json files located in the respective package’s repo, then maintainers might find that highly preferable to submitting PRs to the registry. I’m all for that. This would also help increase adoption of codemeta which I like. Without leveraging codemta, you’d be asking package maintainers to duplicate information across two very-similar JSON files which seems like an anti-practice.

It looks like what I’m lobbying for is decreasing but maintaining curation of the registry by increasing automation a few notches.

1 Like

Should we get all rOpenSci packages to have codemeta files, then use those?

Yes

Should the registry not be a single repo, but a so to speak decentralized registry made up of codemeta files in each repo?

Yes, though a centralized registry can be maintained automatically for ease of querying.

I note there’s some useful information that may not be currently part of the codemeta.json files that we should figure out how to incorporate, or be able to generate by querying other sources. First, whether the package is hosted on CRAN or BioC. Secondly, the package peer-review status (we had discussed this being part of the codemeta schema, I’m not sure what the status of that is).

1 Like

yes, those 3, it does indeed get used for our packages page

true, we do want to reduce any duplication

I’m also hoping for more automation :slight_smile:

Right, to automate this we could query Gabor’s crandb API - and maybe BioC has a web service?

@cboettig thoughts on this?