Statistical software peer review categories

mpadge · May 8, 2020, 5:00pm

rOpenSci’s new project to develop a system to peer-review statistical software has begun, and we are working towards initial submissions hopefully later this year. In the meantime, we would like to ask all those interested to help us in the following three ways:

Following this message is a list of our initial proposed categories which we will accept for peer-review under this system. Could you all please indicate in responses below which categories you think are good or viable, and which not, and also add any other comments you may desire.
We invite anybody interested in any particular categories to please comment below, clearly indicating the category being discussed. Note that at this early stage we are seeking as diverse a range of opinions as possible, rather than category-specific expertise, so please feel free to discuss any particular category regardless of your particular expertise.
Discussions of potential categories other than those on our current list will of course be welcome, for which we first ask you to read about the procedure we employed to derive the following list in the living document of the project at Chapter 4 Scope [SEEKING FEEDBACK] | rOpenSci Statistical Software Peer Review

We request as many contributions as possible over the following week ending Fri 15th May, and aim to reach broad consensus on the initial list of categories in the week after that. We are particularly aiming to cultivate discussions on this discussion forum, but also invite anybody interested to be part of our Slack group - please contact us privately to request an invitation.

Without further ado, the shortlist of categories is:

Bayesian and Monte Carlo algorithms
Dimensionality Reduction and Feature Selection
Machine Learning
Regression and Interpolation
Probability Distributions
Wrapper Packages
Networks
Exploratory Data Analysis
Workflow Software
Summary Statistics
Spatial Statistics

Looking forward to further discussions, and thanking all in advance for any and all forms of participation.

mpadge · May 8, 2020, 5:33pm

We are aware of the likely notable omission of Time Series Analyses as a distinct category, and welcome any thoughts regarding that, particular as we did decide to include Spatial Statistics. Time Series is not in current proposed list primarily because it did not arise sufficiently often in our empirical research, but that need not be interpreted as a definitive argument against, and we’d welcome any discussions in that regard.

dgkf · May 8, 2020, 6:35pm

Really good list. The only gap that jumps out to me is model generalizability / interpretability, but that seems to me like it could sit at the intersection of a few of these existing categories if you’re trying to keep a minimal set.

I had a note about adding time series analysis until I thought to read the rest of the thread . I don’t personally work in that area too much so I don’t feel equipped to weigh in, but as a second-hand observer I often hear it framed as a pretty distinct set of statistics.

Eknackstedt · May 9, 2020, 5:37am

Looks like a good list. Really enjoying the book around this as well. I’m particularly interested in EDA and workflow categories.

michaelweylandt · May 11, 2020, 2:48am

Hi @mpadge,

Some random thoughts on your list. I’m not really a member of the rOpenSci community - though I’m a huge fan of your work! - so please ignore everywhere I’m off-base. I’m writing my initial reactions before clicking through to your previous discussions to hopefully give an outsider’s POV; apologies if I’m restarting any bikeshedding.

Proposed Categories:

Bayesian and Monte Carlo Algorithms

Would “Simulation and Markov Chain Monte Carlo” be a better “bucket”?

MC and MCMC methods are both useful outside the Bayesian context: e.g., I’ve used a Gibbs sampler to sample from nasty graphical-model distributions that don’t have a direct sampler. For a peer review system, I’d probably also include samplers from non-standard distributions (e.g., CRAN - Package pgdraw) in the same review stream. Even though these aren’t MCMC methods, I’d guess you’d draw from the same pool of reviewers.

I don’t know if other Bayesian methods (Variational Inference or INLA) would make sense here as well. Pros: it’s under the same goal of “quickly approximate a distribution;” Cons: the math and effectiveness metrics are very different.

Dimensionality Reduction and Feature Selection

What are y’all thinking of in this category? I think of PCA as the canonical example of DR and the lasso as a canonical example of feature selection and it’s hard to square those. It seems like most DR work combines lots of existing features into new features and hence isn’t selective (sparse PCA and friends aside). Would “DR and Feature Engineering” be a better name?

Machine Learning

This seems really broad. Would supervised / unsupervised make for useful sub-categories?

Regression and Interpolation

Would classification go in here as well (essentially making this the supervised bucket from point 3) or are y’all thinking of classification as a subset of regression (a la logistic regression)?

Would both an lm replacement and a randomForest competitor go in here? (Both as “regression”) A useful split might be “linear models” vs “non-linear models”. (Linear being broadly construed to include penalized methods and hierarchical models and all sorts of related jazz)

Probability Distributions

This seems much narrower than the previous two categories, particularly if general simulation goes in the first category. Is this just providing d, p, q and (possibly) r functions for distributions or is there more?

Wrapper Packages

A great category! Does this only include wrappers for external libraries or wrappers of other R packages (like parsnip or caret) as well?

Networks

No comment. Not my expertise.

EDA

Possibly combine with summary statistics? (That might just reflect how I do EDA) Would it make sense to include graphics in here as well, if they’re not getting their own top-level.

Workflow Software

How broadly are y’all defining “workflow”? Narrowly (drake and kin) or to include things like readxl, rmarkdown, and the like?

Summary Statistics

This feels narrow and ripe for being merged with something else.

Spatial Statistics

If you want to combine with time series analysis, I’ve heard “Dependent Data” as a catch-all, but the software tools for space and time are often quite different, so it probably doesn’t make sense to go that way.

mpadge · May 11, 2020, 1:16pm

Thanks for responses thus far, particularly given that you’ve all joined in for the first time just to respond to this particular discussion. That’s very encouraging - thanks!

@michaelweylandt Thank you for your particularly considered responses, to which a few thoughts:

Would “Simulation and Markov Chain Monte Carlo” be a better “bucket”? (Than current “Bayesian and Monte Carlo Algorithms”)

I am personally not satisfied with that categorical title, which arose through initially distinguishing those two in the background empirical research, and subsequently combining them due to them being very closely related (in statistical terms). I see the “Bayesian” bit being a key - and appropriate - term here, and yes that would mean packages like INLA would tick this categorical box. I actually have issues with “Monte Carlo”, and would personally prefer a term that connoted what is actually happening. The problem is that that quickly and ambiguously overlaps with general simulation approaches, so I remain unsure of a resolution. Further input greatly appreciated!

Dimensionality Reduction and Feature Selection → “DR and Feature Engineering”?

Yeah, maybe. PCA, lasso, MDS, and all those would tick this category. You are right that a lot or most work “isn’t selective”, so I’d be tempted to agree with your suggestion there. Feel free to PR that one into our main document if you like.

Machine Learning: Would supervised / unsupervised make for useful sub-categories?

Our subjective discussions have always assumed exactly that distinction. As main doc explains, the current list is merely reflective of what we attempted to establish an empirical basis for, which is not to say that they are necessarily optimal in any way. A distinction between supervised and unsupervised will likely be necessary somewhere, and I can definitely see a utility within this category, but also a need to note that such a distinction might be useful in other categories as well. Maybe it might be better to leave this as a single category, and have another check box where appropriate defining whether methods are supervised or not? The categories, along with potential additional items like that, are ultimately intended to guide the development of standards and assessment procedures, so the question is really whether standards for and the assessment of supervised versus unsuperivsed ML algorithms differ? (yes!) But: Do standards for, and the assessment of, supervised versus unsupervised algorithms for feature engineering differ? For EDA? For …? Again: Further suggestions appreciated

Regression and Interpolation: Would classification go in here as well?

Yes, classification would (frequently) go in here, but also likely in Feature Engineering, ML, and other categories. Again, the issue is really about standards and assessment. Whether or not the end point is classification or prediction or interpolation, algorithms based on regression techniques are - accordingly to the hypothesis of this category - sufficiently similar to be subject to comparable standards and assessments. In this sense, while distinguishing linear from non-linear might be useful, I do not expect there to be much difference in terms of standards or assessment, and would be tempted to leave that distinction in mind as one to potentially develop in response to a direct or perceived need actually arising.

Probability Distributions - Is this just providing d , p , q … r functions or is there more?

There is more. This category emerges as quite a distinct cluster in our analyses, and is related to things like maximum likelihood techniques and density estimators. It is also a category for which distinct standards and assessments will likely be able to be developed with reference both to R’s very highly developed representations and techniques for probability densities and distributions, as well as things like the US National Institute of Standards and Technolgy’s collection of reference data sets, described in our document here. It’s more anticipated as a checkbox for whether a package handles or uses probability distributions at all, in which case we definitely anticipate distinct assessments being applicable.

Wrapper Packages - Does this include … other R packages … as well?

Yes, that is certainly the vision, even if we primarily anticipate it being used to describe packages which offer wrappers around software originally written in other languages.

EDA - combine with summary statistics? … include graphics in here as well?

Yes on both scores. EDA currently has no text in main doc because we’re unsure what to do with that, or where it might rightly belong. Merging with EDA would likely make sense. As for graphics: We have an “unwritten rule” which we have discussed extensively that we will (initially) exclude packages the primary aim of which is graphical representations, yet obviously many packages implement graphical routines, and particularly those in this category. And yes, distinguishing primary from secondary is likely to be not trivial, but to the extent that we may presume such a distinction, graphical routines will generally be assessed as some kind of “optional extra”.

Workflow software - How broadly are y’all defining “workflow”?

Currently an admittedly ill-defined and, yes, broad notion along the lines of software that “support[s] tasks around the statistical workflow”. Workflow is indeed anticipated to be largely as general as possible, yet with the abiding restriction that the workflows which are supported must be predominantly and primarily statistical. See the above link for more detail.

Thanks for the great feedback, and please feel free to continue the discussion here.

elong0527 · May 12, 2020, 8:56pm

I would suggest to consider the two categories.

study design
meta analysis

Some packages may belong to multiple categories. It might be helpful to discuss how to classify them (e.g. survival package contain functions for regression, exploratory method, and summary statistics) or if it is OK to be marked by multiple categories)

earowang · May 12, 2020, 11:18pm

Time Series Analyses or Temporal Data Analyses (including time series and longitudinal data) could be a useful category itself. The packages listed on Time Series CRAN view hardly fit into the shortlist of categories.

noamross · May 15, 2020, 2:49pm

Some interesting feedback here (as well as on the Slack channel). My thoughts in response to it:

Most packages would be expected to check of more than one category. For instance, one might have a Bayesian Time Series Regression package or a Machine-Learning Clustering package. In each case the guidance and standards for all relevant categories would apply.
I think “Dimensionality Reduction and Feature Selection” should be “Dimensionality Reduction, Clustering, and Unsupervised Learning.” Feature selection or even some feature engineering might or might not be in this category. For instance, I’d put LASSO in Regression because it is a primarily supervised technique.
“Regression and Interpolation” should be “Regression and Supervised Learning”
It makes sense to have both “Time Series Analysis” and “Spatial Analysis”.
“Machine Learning” is a term that means different things to different people, so we should define how we are using it here. For us I think it can mean, “non-likelihood, predictive approaches to model fitting.” Most packages checking off ML would also check off the unsupervised or supervised categories, and standards in the ML category would relate to things like how objective functions are defined, how out-of-sample prediction and regularization / validation is handled, etc.
I like Study Design and Meta Analysis (I’ve heard a few comments on this, too). Many of these would have some overlap. For instance, a Meta Analysis might ultimately be a form of a hierarchical regression, or a power analysis for study design might be a simulation from a regression model. They might not be the first areas we tackle.

killick · May 15, 2020, 11:10pm

Do we need something for a primarily graphical package or would that come under ‘EDA’ (if renamed from summary statistics), or ‘wrapper package’?

noamross · May 17, 2020, 9:36pm

I would argue that primarily visualization packages would be out of scope for now (see Mark’s background on this). This is because a “good” visualization, while not totally subjective, would need be governed by a whole other sense of principles than what makes a statistical technique correct.

mpadge · May 19, 2020, 11:04am

Thanks to all for participating in discussions thus far, in response to which we propose the following initial categorisation to use to guide the development of category-specific standards and assessment procedures. The categories are intended as checklist items, and it is anticipated that software submissions will typically check multiple categories, each of which will trigger distinct aspects of assessment. The aim of the proposed categorisation is accordingly to capture important and ideally orthogonal aspects of statistical software in general that we should consider in our development of standards and assessment. Based on feedback over the past week, our revised looks something like this:

Bayesian and Monte Carlo algorithms
Dimensionality Reduction, Clustering, and Unsupervised Learning
Machine Learning
Regression and Supervised Learning
Probability Distributions
Wrapper Packages
Networks
Exploratory Data Analysis (EDA) and Summary Statistics
Workflow Software
Spatial Statistics
Time Series

The “EDA and Summary Statistics” category could, at least initially, encompass aspects of study design and meta analysis as proposed by @elong0527 particularly as both study design and meta analysis can be interpreted as forms of summary statistics, along with other categories such as regression as mentioned by @noamross above, or workflow software for study design.

We welcome any additional comments here, or via the main project document.

mpadge · May 29, 2020, 9:16pm

We have decided to proceed with the development of prototype standards and assessment procedures for four of the above categories of statistical software. Each of these now has its own thread, and we invite and encourage contributions to each of these categories - please click directly on the links below to go to the individual threads:

Looking forward to category-specific contributions and insights in each of those threads

privefl · June 19, 2020, 1:21pm

Good to know that there are all these new categories.
What about a section on “Statistical tools for Large-Scale Data”?

Topic		Replies	Views
Statistical Software: Exploratory Data Analysis and Summary Statistics Statistical Software Peer Review eda	7	1144	August 24, 2020
Statistical Software: General Standards Statistical Software Peer Review	0	669	August 24, 2020
Statistical Software: Regression and Supervised Learning Statistical Software Peer Review regression , supervised-learning	18	1282	August 24, 2020
Statistical Software: Bayesian Analyses Statistical Software Peer Review bayesian	8	1487	August 24, 2020
Statistical Software: Time Series Statistical Software Peer Review time-series	4	1019	June 7, 2021

Statistical software peer review categories

Related topics