rOpenSci onboarding: package fit

One of the criteria we use for submitted packages to our onboarding repo is how well it fits. See guidelines about fit at https://github.com/ropensci/onboarding/blob/master/policies.md#package-fit

In that link above for our policies about fit we list a number of areas that are considered in scope, or a good fit. In brief bulleted form:

  • data retrieval (from APIs, data storage services, journals, and other remote servers). The data retrieved must have a scientific application and merely wrapping an API that serves data does not meet our criteria. (e.g.'s rplos)
  • data extraction tools that aid in retrieving data from unstructured sources such as text, images and PDFs. (e.g.'s pdftools)
  • data visualization (interactive graphics in R that extend beyond base and ggplot2). (e.g.'s plotly)
  • data deposition into research repositories, including metadata generation. (e.g.'s zenodo)
  • data munging (In the context of the tools described above. Generic tools such as reshape2, tidyr do not fit this criteria). Geospatial tools fall under this category. (e.g.'s geojsonio)
  • data packages that aggregate large, heterogenous data sets of scientific value or provide R-specific formats for widely-used data (e.g., shapefiles for geographic boundaries) (e.g.'s rnaturalearth)
  • reproducibility (tools that facilitate reproducible research, such as interfacing with git to track provenance or similar). (e.g.'s git2r)

These are rOpenSci’s onboarding policies for fit and scope at this time. This discussion is aimed at revising the scope and deciding what areas to broaden and what others to focus more narrowly.

Let us know what you think. What should remain as is, what should change.

Maybe add a few examples of existing repositories in the guidelines? For instance I’m not sure that “data munging (In the context of the tools described above. Generic tools such as reshape2, tidyr do not fit this criteria)” is very clear. :slight_smile:

Another point would be to state how to make a “pre-submission enquiry” in the case the package author is not sure he/she has understood whether their package fits? Or saying that opening an issue does not demand much effort so that when in doubt they should submit the package?

1 Like

will do! good idea.

Good point. Should we encourage people to simply submit and fit is part of the discussion, or if they aren’t sure their package is a good fit to open an issue to discuss fit?

It seems we have two sort of categories of packages: those that have to do with specific data types , repositories and scientific sub-fields, and then more general R tools. The specific ones have included taxonomic, geospatial, scientific literature data, and more specialized data like oderant responses (DoOR). In the context of narrow data types, its easy to be inclusive of tools that do retrieval/extraction/manipulation/visualization for those specific things as long as they are related to a scientific field.

Then we have more general packages, such as, git2r, our database clients, and general data things like assertr. I think these are great, and can all be captured under “reproducibility”. But its sort of a catch-all for everything. The question is how to define the boundaries of this category? I note that this conversation partly kicked off over analogsea, Scott’s Digital Ocean client. This package is definitely a win for reproducibility, and probably would be (is?) very widely used both by scientists and others.

Data visualization is also potentially very broad - pretty much everything going on in R graphics these days has to do with interactive web graphics somehow.

I concur on pre-submission. We could just add a note to onboarding docs to let people know to open an issue if they are unsure if their package is a good fit.

1 Like

I think these categories are mostly straightforward except for (1) data munging because that would seem to mean things like reshape2, tidyr, etc. but doesn’t mean those, so I’m not sure what it means, and (2) data visualization. Are all plotting packages in scope or are they limited to those that connect to a web service (like plotly)?

Regardless of the final set of categories, it may make sense to align those categories with the ones used as headers on the packages page. For example, that list would seem to imply that geospatial tools are particularly important, but that’s not one of the categories described in “package fit”.

geospatial tools are particularly important, but that’s not one of the categories described in “package fit”.

Yeah, geospatial is sort of a cross-cutting category, and I think that’s OK. If we update our packages page to be more dynamic perhaps this will be a “tag” across categories, along with “text analysis”.

For “data munging” the prototypical package I imagine is something for parsing data from formats generated by scientific equipment (sort of like genbankr which parses genbank files, rather than accesses the database). This isn’t too different from data extraction except that the data may be structured, just in other format types. Similarly, they may be packages that generate specific output types, like those that would ultimately be used in data deposition or be used by other research software tools. assertr falls outside this, but I think is well in the category of reproducibility.

I’m in the process of getting the ropensci API up, which includes categories for all of the packages - we should have a way of getting feedback on those once it’s up and everyone can see what packages are assigned in which areas. Then we can update as needed