Statistical Software: Regression and Supervised Learning

Tags: #<Tag:0x00007fe4f108a948> #<Tag:0x00007fe4f1088620>

Hi all! This thread is for discussion of standards for peer review of Regression or Supervised Learning packages, one of our initial statistical package categories. We’re interested in the community’s input as to what standards should apply specifically to such packages.

Regression and Supervised Learning includes all packages providing unique and statistical mappings between input and output data. This category includes interpolation, imputation, and inference. Note that categories are non-exclusive, and individual packages will commonly fit several categories; Regression packages may also be, for instance, Bayesian or Time-Series packages.

We’d like your answers to these questions:

  • What should a regression or supervised package do so as to pass a peer review process?
  • How should regression software do that?
  • What should be documented in regression packages?
  • How should regression packages be tested?

Don’t worry if your contributions might apply to other categories. We will look at suggestions across topics to see which should be elevated to more general standards. Please do comment on whether you think a standard should be required or recommended for a package to pass peer-review. We’ll use these comments as input for prioritizing standards in the future.

For some useful references, see our background chapter, and Alex Hayes’ post on statistical testing. Feel free to drop other references in the thread!

1 Like

Here are my initial and incomplete thoughts on some I’d like to see required and reviewed in regression packages:

  • Provide a method to retrieve convergence statistics and diagnostics from the fit model, and report poor convergence in appropriate warnings or errors.

  • Test and document expected behavior under the following conditions, and provide informative runtime warnings or errors when they occur:

    • Strong dependence amongst predictor variables
    • Complete dependence amongst predictor variables
    • Perfect, noiseless dependent/independent variable relationships
    • Violation of statistical assumptions (e.g., data distribution)*
    • Multiple random seeds or initial conditions (e.g., is the algorithm perfectly deterministic or is there expected non-determinism )*
      • Interaction of strong dependence and random variation is a good thing to test!
    • Perturbation of data, e.g., resampling or noise introduction. (This could get pretty elaborate, so I’d say testing the distributional properties expected with resampling is optional/best)*
    • A range of data sizes (again, testing performance/data-size relationships and expected asymptotic values may optional/best)*
    • Missing data: Are inputs with missing fields removed, included, imputed, or otherwise handled?*
    • Data less then required for the algorithm (e.g., more parameters than data)*
    • Zero-length data*
    • Data of unexpected type*
  • Export data sets used for testing in the package for user reuse*

    • Where possible, using standard data sets with known properties is encouraged.*
  • Document the size of data and other data properi under which the package has been tested and is expected to perform*

  • Document if and how data is transformed for estimation, document how and if outputs (parameters, predictions) are re-transformed, and provide a method to access the transformation and inverse transformation functions.*

  • Report and provide methods to extract appropriate goodness-of-fit statistics

  • Report and provide methods to extract parameters, tests of parameters, and effect sizes

  • Provide a mechanism to extract, examine, and visualize residuals and other diagnostic values and documentation to interpret them. It may do so via dedicated functions or appropriate examples, vignettes, or documentation demonstrating workflows from other packages.

Asterisks * designate standards that probably go beyond the “Regression” category to other model-fitting or ML, which is quite a few!

Max Kuhn also pointed us to Stephen Milborrow’s Guidelines for S3 Regression Models which is a good collection of best practices for API interfaces, which I think is not to prescriptive and largely adoptable (for S3, atleast). His own Conventions for R Modeling Packages from tidyworld is a little more prescriptive.

1 Like

Inspired by @ucfagls’s recent post I might add, “document and test expected behavior for extrapolation/out-of-sample prediction”.

A key one for me would be interface for modelling:

  1. Is the interface clearly documented, including the format for predictors, response variables (or labels for semi-supervised learning), and specialized functions (transformations, missing values, offsets, etc.)?
  2. Does the interface conform as closely as possible to other R regression packages? i.e. effectively using the formula interface, (X | Y) terms denoting some form of random effects, etc. Also, do common R regression functions, such as offset() or the na.omit()/na.exclude() etc. work in a manner consistent with their use in other R packages in the new package?
  3. Is it clear how to use the fitted model to predict new data, either within the range of the training data set or extrapolating to new data? Is it clear what is actually being predicted (mean of link function, mean of response variable, range of data, etc.)?
1 Like

Some quick initial thoughts on this based on what others have posted so far…

This quite spartan R developer page has been useful for me previously, in addition to the links provided above. I’ll echo Eric and the tidyworld page in saying that appropriate behaviour of na.action etc is really important, especially for the case where you’re running multiple models in comparison and expect their output to be the same shape.

Noam: in your “test and document” section it’s hard to know which of these should be warnings in the package vs. which should be tested or documented. For example, “strong dependence” seems context specific (which measure to use, what does how strong is “strong”) and warnings may not be appropriate (either too conservative or not conservative enough in a given application), so could be misleading for modellers. Extrapolation can be hard to define in a mutlivariate context (are you extrapolating in the middle of a donut?), so this is probably something that’s difficult to generalize outside of a given subject domain (compare its definition in time series vs species distribution modelling). So I think it’s important to avoid being too prescriptive on some of these.

I’m not sure about whether “violation of statistical assumptions” should be included as warnings/errors or as including model checks/diagnostics. I guess it’s both, e.g. fitting normal data with family=poisson() will yield and error if the response is negative but only a warning if the data are non-integer. Using some other distribution will just yield bad residual plots that you’ll need to diagnose yourself.

It’s probably also worth bearing in mind that for new (in the academic, decadal-level sense of “new”) methods, model checking procedures may not have been developed yet. I don’t think that should necessarily disqualify a package, but those deficiencies should be documented (what can you currently not test?). (With the obvious caveat that if you can’t tell if the model works at all then it’s not useful!)

Though the current set of functions used in popular R packages is not perfect, the predict, plot, summary, logLik, residuals etc methods are at least familiar, so offer a good entry point for new users. Avoiding naming schemes like mymodel_map/mymodel_prediction rather than predict should minimize the mental effort for new or infrequent users. To that end I think the Milborrow guide is a good start.

I’ll have a think about testing and maybe post again later if I can think of anything useful.

3 Likes

Thanks @eric-pedersen and @dill. Dave, I really like the thought of “What can you currently not test?” as a question to pose to authors to prompt clearer documentation.

I agree on the that there are limits of what we can prescribe, but I think at this point we should be expansive in what we request or suggest, and ask authors work through a list and note exceptions where they do not apply. It’s far more common that none of these things are tested.

1 Like

I think there should be model checks available, but failing those checks should not return a warning / error. Whether a specific assumption is violated will only matter for some analyses, and whether an assumption is violated or not will depend on the strength of evidence required to determine model violations. For some assumptions (like independence) there are an infinite number of ways of violating the assumption, so it’s not possible to test them all.

I’m also a bit leery about having specific functions to test model assumptions; while I don’t have hard data on it, I think these kinds of functions encourage superficial testing, where authors will assume their model is accurate if it passes a few pre-set tests. I’ve often seen regression papers with model checks that consist of noting that their residuals appear normal under a KS-test, for instance.

In general, I think a package should include documentation that:

a) clearly lays out what the model is assuming

b) suggests tools or approaches to include checks for assumption violations into a standard workflow, based on standard R functions and model-fitting functions.

Ideally, it would also be good to have a vignette that shows what a typical workflow would look like, including model checks in it.

2 Likes

Totally agree on avoiding tick box checks, it’s important to ensure that kind of behaviour is discouraged. At the same time, people need to be enabled do what they need to do to check models and either ignore checking or write them from scratch themselves (which is a recipe for a different kind of disaster).

1 Like

Very true. If a check is difficult to write, it should definitely have its own function associated with it, like k.check in mgcv.

Some misc comments:

  • I strongly, strongly encourage an official statement against following BDR’s model building advice, which focuses on getting meta-programming right at the expense of pretty much everything else. At the very least, there needs to be some strong hinting that following that advice is in general inadequate/insufficient for building useful software.

  • Some comments about separating hyperparameter tuning and model fitting would probably be helpful.

  • hardhat provides lots of infrastructure for type checking in supervised problems

  • Convergence diagnostics are not specific to supervised learning; they should be used everywhere.

  • A standard test for supervised models with high capacities is to see if the model can achieve zero training error (potentially on a suitably constructed artificial dataset). Failure to overfit (or achieve low risk on toy problems) is a really useful red flag.

  • Especially for complicated models, it is often valuable to separate classes for estimators and estimates. I wrote in more detail about this here. Further, it is absolutely essential that different estimates correspond to different classes. Again, more details in the blog post.

  • Models objects shouldn’t save training data in them. In general, people using models in production care a lot more about model object size than academics realize.

With respect to implementation in a standardised way that fosters automation, I’ve found the Scienceverse project and R package really interesting.

From the package docs:

The goal of scienceverse is to generate and process machine-readable study descriptions to facilitate archiving studies, pre-registering studies, finding variables and measures used in other research, meta-analyses of studies, and finding and re-using datasets in other ways.

…allow users to generate rich, standardized metadata describing experiments, materials, data, code, and any other research components that scholars want to share. Such standardization would facilitate reproducibility, cumulative science (e.g., meta-analysis) and reuse (e.g., finding datasets with specific measures). While many projects focus on making data FAIR, Scienceverse aims to make every aspect of research findable, accessible, interoperable and reusable.

Late to the party, trying to catch up. Lots to talk about.

  • “report poor convergence”: I agree that packages should report whatever they know about (optim()'s quietly reporting a non-zero convergence code has bitten me many times), but also agree that good convergence diagnostics aren’t always available.
  • “report strong dependence”: how and why? Are we reporting strong dependence among predictor variables (which might lead to poor computational/numerical behaviour in some cases) or among estimated parameters? It’s certainly useful to be able to report e.g. correlations among estimated parameters (where that’s applicable), but the degree of correlation that is interesting is highly case-specific (even within use of a particular package).
  • re diagnostics/tests of assumptions: agree with @eric-pedersen that null-hypothesis testing may encourage bad behaviour; graphical diagnostics are good. On the other hand, how opinionated are we aiming to be?
  • there is an entire ecosystem of post-processing packages available (off the top of my head broom, effects, emmeans, DHARMa, sjstats, sjplot, multcomp … how much do we encourage package authors to interface with (which parts of?) this ecosystem?
  • probably worth laying out the entire set of “default” accessor methods (not all of which will be applicable in a given case) and prioritizing them.
    • vcov() is important for post-processing (but how should models with multiple options, e.g. models with zero-inflation and conditional components, handle this?)
    • the simulate() method is extremely useful (and I have a personal opinion that allowing it to be extended with newdata and newparams arguments makes it much more useful
  • agree with @eric-pedersen that consistent handling of the offset parameter is important
  • agree with @alexpghayes that it’s really useful to be able to get an ‘unevaluated’ or ‘unfitted’ version of a model object back (Julia does this in a sensible way, IIRC)
  • also agree with @alexpghayes that careful thought about saving training data: save? don’t save? allow an option? allow an option to have training data shared among several model objects? is important (cf. the butcher package). On the other hand, users need to know what they can and can’t expect to do with models once they’re fitted if they haven’t stored the training data
  • update() is very useful but generally very fragile
  • allow options for scaling/centering predictors?
  • important to be clear about what scale parameters are estimated on (constrained vs. unconstrained), i.e. what is the “link function” for a parameter, and what does coef() give you by default?
  • I don’t dislike BDR’s development notes as much as @alexpghayes does: I agree that they could emphasize different aspects, but there’s a lot of valuable stuff in there - especially the little-known details about handling data-dependent bases (see ?makepredictcall) …
1 Like

I am coming to this quite late, but wanted to address the “how” a little bit more than the discussion has so far.

  • I would love to see encouragement for new modeling packages to adopt boring vanilla data formats for input and output (nothing fancy or unique) and S3 methods. This contributes both to correctness and to interoperability with the ecosystem of post-processing packages that @bbolker mentioned. Certainly hardhat offers scaffolding for this, if modelers are interested in using it.

  • Clarity about data validation and data transformation (how are factors handled? so differently by so many models! scaling/centering? :woman_shrugging: who even knows when this happens) is so important. I see this as something to consider in implementation (good practices around validation and transformation) and documentation, i.e. “factor predictors are converted to…” in docs.

  • I haven’t seen much discussion here of just really straightforward testing for correctness and how to do it, although I guess that is in @alexpghayes’s blog post. That is what I really would like to see in peer review standards, more even than (what seem to me) more subtle issues around assumptions. Some ideas to suggest there are using simulated datasets, resampling real ones, etc.

1 Like

What do people think about author choice to use an x and y numeric matrix API rather than a formula API?

From my perspective this is a reasonable choice for someone who wants to do a basic, minimalist implementation rather than do pre-processing on data frames and generate matrices. When transparently implemented, it should allow others to wrap it using other tools.

Agreed, although it will obviously make it harder to use the package in conjunction with other packages that use a formula interface … can’t think of a reason it should be more than one step away if you already have the formula and the raw data (i.e. X <- model.matrix(my_formula, my_data)). I’d be curious to find out more from package maintainers who have made this choice (glmnet is the only such package that springs to mind, and that implementation is pretty old-school …)

By the way there was a post on testing by Bob O’Hara on Andrew Gelman’s blog. In particular, he says “There are whole books written about numerical analysis, much of which centers around testing”. I’d love to have pointers to these books: from my very limited experience with numerical analysis, most of what you can actually feasibly do is derive rigorous error bounds for very particular sets of analyses, with lots of work - although see e.g. Chang et al 2020 …

Brian Ripley also complains a lot about arbitrarily chosen test tolerances, but I’ve never managed to figure out the procedure by which he thinks that be established rigorously across platforms for complex procedures …

Chang, Xiao-Wen, Peng Kang, and David Titley-Peloquin. “Error Bounds for Computed Least Squares Estimators.” Linear Algebra and Its Applications 586 (February 1, 2020): 28–42. https://doi.org/10.1016/j.laa.2019.10.014.

The tidymodels team has a package in large part providing tools for dealing with formula interface edge-cases that are less than than straightforward: http://hardhat.tidymodels.org. Maybe Max and @juliasilge might characterize it differently, but that’s in part why I think formula interface can be a non-trival problem issue.

I am not generally a fan of this choice within the R ecosystem, given how either base R users or tidyverse users tend to think about their data structures, data cleaning, and modeling. I would think that a formula method and a data frame method are typically the minimum.

There are some models that are good fits for sparse data, and having methods for specialized sparse matrix types can be helpful. Perhaps there are special cases (some specialized regression modeling package for sparse data, for example) where it really doesn’t make sense to support a formula method.

We now have preliminary drafts of both general standards for statistical software, and specific extensions into a few initial categories - Regression, Bayesian and Monte Carlo, Exploratory Data Analysis, and Time Series. We would very much appreciate any feedback, comments, suggestions, improvements to any and all of the current standards. Everybody is encouraged to peruse the single “master” document in bookdown form, and to provide feedback in increasing orders of formality in one or more of the following ways:

  1. The #stats-peer-review slack channel
  2. The relevant sections of the
    discussion forum
  3. The github repository for the “master” document, either via issues for general discussion, or pull requests for more concrete improvements.

Note that we anticipate some degree of “shuffling” between general and category-specific standards, much of which will be deferred until we have developed standards for all of our anticipated categories. There is thus one and only one form of comment for which we are currently not seeking feedback, which is comments regard whether category-specific standard X might be better considered general standard Y - that will be worked out later.

Looking forward to any and all feedback from anybody interested in helping to create our system for peer reviewing statistical software. Without further ado, the relevant standards follow immediately below.


Standards for Regression And Supervised Learning Software

This document details standards for Regression and Supervised Learning Software. Common purposes of Regression and Supervised Learning Software – referred to from here on for simplicity as “Regression Software” – are to fit models to estimate relationships or to make predictions between specified inputs and outputs. Regression Software includes tools with inferential or predictive foci, Bayesian, frequentist, or probability-free Machine Learning (ML) approaches, parametric or or non-parametric approaches, discrete outputs (such as in classification tasks) or continuous outputs, and models and algorithms specific to applications or data such as time series or spatial data. In many cases other standards specific to these subcategories may apply.

The following standards are divided among several sub-categories, with each standard prefixed with “RE”.

1. Input data structures and validation

  • RE1.0 Regression Software should enable models to be specified via a formula interface, unless reasons for not doing so are explicitly documented.
  • RE1.1 Regression Software should document how formula interfaces are converted to matrix representations of input data. See Max Kuhn’s RStudio blog post for examples.
  • RE1.2 Regression Software should document expected format (types or classes) for inputting predictor variables, including descriptions of types or classes which are not accepted; for example, specification that software accepts only numeric inputs in vector or matrix form, or that all inputs must be in data.frame form with both column and row names.
  • RE1.3 Regression Software should explicitly document any aspects of input data (such as row names) which are not used in model construction.
  • RE1.4 Regression Software should document any assumptions made with regard to input data; for example distributional assumptions, or assumptions that predictor data have mean values of zero. Implications of violations of these assumptions should be both documented and tested.

2. Pre-processing and Variable Transformation

  • RE2.0 Regression Software should document any transformations applied to input data, for example conversion of label-values to factor, and should provide ways to explicitly avoid any default transformations (with error or warning conditions where appropriate).
  • RE2.1 Regression Software should implement explicit parameters controlling the processing of missing values, ideally distinguishing NA or NaN values from Inf values (for example, through use of na.omit() and related functions from the stats package).
  • RE2.2 Regression Software should provide different options for processing missing values in predictor and response data. For example, it should be possible to fit a model with no missing predictor data in order to generate values for all associated response points, even where submitted response values may be missing.
  • RE2.3 Where applicable, Regression Software should enable data to be centred (for example, through converting to zero-mean equivalent values; or to z-scores) or offset (for example, to zero-intercept equivalent values) via additional parameters, with the effects of any such parameters clearly documented and tested.
  • RE2.4 Regression Software should implement pre-processing routines to identify whether aspects of input data are perfectly collinear, notably including:
    • RE2.4a Perfect collinearity among predictor variables
    • RE2.4b Perfect collinearity between independent and dependent variables

These pre-processing routines should also be tested as described below.

3. Algorithms

The following standards apply to the model fitting algorithms of Regression Software which implements or relies on iterative algorithms which are expected to converge to generate model statistics. Regression Software which implements or relies on iterative convergence algorithms should:

  • RE3.0 Issue appropriate warnings or other diagnostic messages for models which fail to converge.
  • RE3.1 Enable such messages to be optionally suppressed, yet should ensure that the resultant model object nevertheless includes sufficient data to identify lack of convergence.
  • RE3.2 Ensure that convergence thresholds have sensible default values, demonstrated through explicit documentation.
  • RE3.3 Allow explicit setting of convergence thresholds, unless reasons against doing so are explicitly documented.

4. Return Results

  • RE4.0 Regression Software should return some form of “model” object, generally through using or modifying existing class structures for model objects (such as lm, glm, or model objects from other packages), or creating a new class of model objects.
  • RE4.1 Regression Software may enable an ability to generate a model object without actually fitting values. This may be useful for controlling batch processing of computationally intensive fitting algorithms.

4.1 Accessor Methods

Regression Software should provide functions to access or extract as much of the following kinds of model data as possible or practicable. Access should ideally rely on class-specific methods which extend, or implement otherwise equivalent versions of, the methods from the stats package which are named in parentheses in each of the following standards.

The model objects should include, or otherwise enable effectively immediate access to,

  • RE4.2 Model coefficients (via coeff() / coefficients())
  • RE4.3 Confidence intervals on those coefficients (via confint())
  • RE4.4 The specification of the model, generally as a formula (via formula())
  • RE4.5 Numbers of observations submitted to model (via nobs())
  • RE4.6 The variance-covariance matrix of the model parameters (via vcov())
  • RE4.7 Where appropriate, convergence statistics

Regression Software should provide simple and direct methods to return or otherwise access the following form of data and metadata, where the latter includes information on any transformations which may have been applied to the data prior to submission to modelling routines.

  • RE4.8 Response variables, and associated “metadata” where applicable.
  • RE4.9 Modelled values of response variables.
  • RE4.10 Model Residuals, including sufficient documentation to enable interpretation of residuals, and to enable users to submit residuals to their own tests.
  • RE4.11 Goodness-of-fit and other statistics associated such as effect sizes with model coefficients.
  • RE4.12 Where appropriate, functions used to transform input data, and associated inverse transform functions.

Regression software may provide simple and direct methods to return or otherwise access the following:

  • RE4.13 Predictor variables, and associated “metadata” where applicable.

4.2 Extrapolation and Forecasting

  • RE4.14 Where Regression Software is intended to, or can, be used to extrapolate or forecast values, values should also be provided for extrapolation or forecast errors.
  • RE4.15 Sufficient documentation and/or testing should be provided to demonstrate that forecast errors, confidence intervals, or equivalent values increase with forecast horizons.

4.3 Reporting Return Results

  • RE4.16 Model objects returned by Regression Software should implement or appropriately extend a default print method which provides an on-screen summary of model (input) parameters and (output) coefficients.
  • RE4.17 Regression Software may also implement summary methods for model objects, and in particular should implement distinct summary methods for any cases in which calculation of summary statistics is computationally non-trivial (for example, for bootstrapped estimates of confidence intervals).

5. Documentation

Beyond the general standards for documentation, Regression Software should explicitly describe the following aspects, and ideally provide extended documentation including graphical output:

  • RE5.0 Scaling relationships between sizes of input data (numbers of observations, with potential extension to numbers of variables/columns) and speed of algorithm.

6. Visualization

  • RE6.0 Model objects returned by Regression Software (see RE3.0) should have default plot methods, either through explicit implementation, extension of methods for existing model objects, or through ensuring default methods work appropriately.
  • RE6.1 The default plot method should produce a plot of the fitted values of the model, with optional visualisation of confidence intervals or equivalent.
  • RE6.2 Where a model object is used to generate a forecast (for example, through a predict() method), the default plot method should provide clear visual distinction between modelled (interpolated) and forecast (extrapolated) values.

7. Testing

Tests for Regression Software should include the following conditions and cases:

  • RE7.0 Tests with noiseless, exact relationships between predictor (independent) data.
    • RE7.0a In particular, these tests should confirm that model fitting is at least as fast or (preferably) faster than testing with equivalent noisy data (see RE2.4a).
  • RE7.1 Tests with noiseless, exact relationships between predictor (independent) and response (dependent) data.
    • RE7.1a In particular, these tests should confirm that model fitting is at least as fast or (preferably) faster than testing with equivalent noisy data (see RE2.4b).