Statistical Software: Regression and Supervised Learning

Tags: #<Tag:0x00007fc006e044d0> #<Tag:0x00007fc006e04368>

Hi all! This thread is for discussion of standards for peer review of Regression or Supervised Learning packages, one of our initial statistical package categories. We’re interested in the community’s input as to what standards should apply specifically to such packages.

Regression and Supervised Learning includes all packages providing unique and statistical mappings between input and output data. This category includes interpolation, imputation, and inference. Note that categories are non-exclusive, and individual packages will commonly fit several categories; Regression packages may also be, for instance, Bayesian or Time-Series packages.

We’d like your answers to these questions:

  • What should a regression or supervised package do so as to pass a peer review process?
  • How should regression software do that?
  • What should be documented in regression packages?
  • How should regression packages be tested?

Don’t worry if your contributions might apply to other categories. We will look at suggestions across topics to see which should be elevated to more general standards. Please do comment on whether you think a standard should be required or recommended for a package to pass peer-review. We’ll use these comments as input for prioritizing standards in the future.

For some useful references, see our background chapter, and Alex Hayes’ post on statistical testing. Feel free to drop other references in the thread!

1 Like

Here are my initial and incomplete thoughts on some I’d like to see required and reviewed in regression packages:

  • Provide a method to retrieve convergence statistics and diagnostics from the fit model, and report poor convergence in appropriate warnings or errors.

  • Test and document expected behavior under the following conditions, and provide informative runtime warnings or errors when they occur:

    • Strong dependence amongst predictor variables
    • Complete dependence amongst predictor variables
    • Perfect, noiseless dependent/independent variable relationships
    • Violation of statistical assumptions (e.g., data distribution)*
    • Multiple random seeds or initial conditions (e.g., is the algorithm perfectly deterministic or is there expected non-determinism )*
      • Interaction of strong dependence and random variation is a good thing to test!
    • Perturbation of data, e.g., resampling or noise introduction. (This could get pretty elaborate, so I’d say testing the distributional properties expected with resampling is optional/best)*
    • A range of data sizes (again, testing performance/data-size relationships and expected asymptotic values may optional/best)*
    • Missing data: Are inputs with missing fields removed, included, imputed, or otherwise handled?*
    • Data less then required for the algorithm (e.g., more parameters than data)*
    • Zero-length data*
    • Data of unexpected type*
  • Export data sets used for testing in the package for user reuse*

    • Where possible, using standard data sets with known properties is encouraged.*
  • Document the size of data and other data properi under which the package has been tested and is expected to perform*

  • Document if and how data is transformed for estimation, document how and if outputs (parameters, predictions) are re-transformed, and provide a method to access the transformation and inverse transformation functions.*

  • Report and provide methods to extract appropriate goodness-of-fit statistics

  • Report and provide methods to extract parameters, tests of parameters, and effect sizes

  • Provide a mechanism to extract, examine, and visualize residuals and other diagnostic values and documentation to interpret them. It may do so via dedicated functions or appropriate examples, vignettes, or documentation demonstrating workflows from other packages.

Asterisks * designate standards that probably go beyond the “Regression” category to other model-fitting or ML, which is quite a few!

Max Kuhn also pointed us to Stephen Milborrow’s Guidelines for S3 Regression Models which is a good collection of best practices for API interfaces, which I think is not to prescriptive and largely adoptable (for S3, atleast). His own Conventions for R Modeling Packages from tidyworld is a little more prescriptive.

1 Like

Inspired by @ucfagls’s recent post I might add, “document and test expected behavior for extrapolation/out-of-sample prediction”.

A key one for me would be interface for modelling:

  1. Is the interface clearly documented, including the format for predictors, response variables (or labels for semi-supervised learning), and specialized functions (transformations, missing values, offsets, etc.)?
  2. Does the interface conform as closely as possible to other R regression packages? i.e. effectively using the formula interface, (X | Y) terms denoting some form of random effects, etc. Also, do common R regression functions, such as offset() or the na.omit()/na.exclude() etc. work in a manner consistent with their use in other R packages in the new package?
  3. Is it clear how to use the fitted model to predict new data, either within the range of the training data set or extrapolating to new data? Is it clear what is actually being predicted (mean of link function, mean of response variable, range of data, etc.)?
1 Like

Some quick initial thoughts on this based on what others have posted so far…

This quite spartan R developer page has been useful for me previously, in addition to the links provided above. I’ll echo Eric and the tidyworld page in saying that appropriate behaviour of na.action etc is really important, especially for the case where you’re running multiple models in comparison and expect their output to be the same shape.

Noam: in your “test and document” section it’s hard to know which of these should be warnings in the package vs. which should be tested or documented. For example, “strong dependence” seems context specific (which measure to use, what does how strong is “strong”) and warnings may not be appropriate (either too conservative or not conservative enough in a given application), so could be misleading for modellers. Extrapolation can be hard to define in a mutlivariate context (are you extrapolating in the middle of a donut?), so this is probably something that’s difficult to generalize outside of a given subject domain (compare its definition in time series vs species distribution modelling). So I think it’s important to avoid being too prescriptive on some of these.

I’m not sure about whether “violation of statistical assumptions” should be included as warnings/errors or as including model checks/diagnostics. I guess it’s both, e.g. fitting normal data with family=poisson() will yield and error if the response is negative but only a warning if the data are non-integer. Using some other distribution will just yield bad residual plots that you’ll need to diagnose yourself.

It’s probably also worth bearing in mind that for new (in the academic, decadal-level sense of “new”) methods, model checking procedures may not have been developed yet. I don’t think that should necessarily disqualify a package, but those deficiencies should be documented (what can you currently not test?). (With the obvious caveat that if you can’t tell if the model works at all then it’s not useful!)

Though the current set of functions used in popular R packages is not perfect, the predict, plot, summary, logLik, residuals etc methods are at least familiar, so offer a good entry point for new users. Avoiding naming schemes like mymodel_map/mymodel_prediction rather than predict should minimize the mental effort for new or infrequent users. To that end I think the Milborrow guide is a good start.

I’ll have a think about testing and maybe post again later if I can think of anything useful.

3 Likes

Thanks @eric-pedersen and @dill. Dave, I really like the thought of “What can you currently not test?” as a question to pose to authors to prompt clearer documentation.

I agree on the that there are limits of what we can prescribe, but I think at this point we should be expansive in what we request or suggest, and ask authors work through a list and note exceptions where they do not apply. It’s far more common that none of these things are tested.

1 Like

I think there should be model checks available, but failing those checks should not return a warning / error. Whether a specific assumption is violated will only matter for some analyses, and whether an assumption is violated or not will depend on the strength of evidence required to determine model violations. For some assumptions (like independence) there are an infinite number of ways of violating the assumption, so it’s not possible to test them all.

I’m also a bit leery about having specific functions to test model assumptions; while I don’t have hard data on it, I think these kinds of functions encourage superficial testing, where authors will assume their model is accurate if it passes a few pre-set tests. I’ve often seen regression papers with model checks that consist of noting that their residuals appear normal under a KS-test, for instance.

In general, I think a package should include documentation that:

a) clearly lays out what the model is assuming

b) suggests tools or approaches to include checks for assumption violations into a standard workflow, based on standard R functions and model-fitting functions.

Ideally, it would also be good to have a vignette that shows what a typical workflow would look like, including model checks in it.

2 Likes

Totally agree on avoiding tick box checks, it’s important to ensure that kind of behaviour is discouraged. At the same time, people need to be enabled do what they need to do to check models and either ignore checking or write them from scratch themselves (which is a recipe for a different kind of disaster).

1 Like

Very true. If a check is difficult to write, it should definitely have its own function associated with it, like k.check in mgcv.

Some misc comments:

  • I strongly, strongly encourage an official statement against following BDR’s model building advice, which focuses on getting meta-programming right at the expense of pretty much everything else. At the very least, there needs to be some strong hinting that following that advice is in general inadequate/insufficient for building useful software.

  • Some comments about separating hyperparameter tuning and model fitting would probably be helpful.

  • hardhat provides lots of infrastructure for type checking in supervised problems

  • Convergence diagnostics are not specific to supervised learning; they should be used everywhere.

  • A standard test for supervised models with high capacities is to see if the model can achieve zero training error (potentially on a suitably constructed artificial dataset). Failure to overfit (or achieve low risk on toy problems) is a really useful red flag.

  • Especially for complicated models, it is often valuable to separate classes for estimators and estimates. I wrote in more detail about this here. Further, it is absolutely essential that different estimates correspond to different classes. Again, more details in the blog post.

  • Models objects shouldn’t save training data in them. In general, people using models in production care a lot more about model object size than academics realize.

With respect to implementation in a standardised way that fosters automation, I’ve found the Scienceverse project and R package really interesting.

From the package docs:

The goal of scienceverse is to generate and process machine-readable study descriptions to facilitate archiving studies, pre-registering studies, finding variables and measures used in other research, meta-analyses of studies, and finding and re-using datasets in other ways.

…allow users to generate rich, standardized metadata describing experiments, materials, data, code, and any other research components that scholars want to share. Such standardization would facilitate reproducibility, cumulative science (e.g., meta-analysis) and reuse (e.g., finding datasets with specific measures). While many projects focus on making data FAIR, Scienceverse aims to make every aspect of research findable, accessible, interoperable and reusable.

Late to the party, trying to catch up. Lots to talk about.

  • “report poor convergence”: I agree that packages should report whatever they know about (optim()'s quietly reporting a non-zero convergence code has bitten me many times), but also agree that good convergence diagnostics aren’t always available.
  • “report strong dependence”: how and why? Are we reporting strong dependence among predictor variables (which might lead to poor computational/numerical behaviour in some cases) or among estimated parameters? It’s certainly useful to be able to report e.g. correlations among estimated parameters (where that’s applicable), but the degree of correlation that is interesting is highly case-specific (even within use of a particular package).
  • re diagnostics/tests of assumptions: agree with @eric-pedersen that null-hypothesis testing may encourage bad behaviour; graphical diagnostics are good. On the other hand, how opinionated are we aiming to be?
  • there is an entire ecosystem of post-processing packages available (off the top of my head broom, effects, emmeans, DHARMa, sjstats, sjplot, multcomp … how much do we encourage package authors to interface with (which parts of?) this ecosystem?
  • probably worth laying out the entire set of “default” accessor methods (not all of which will be applicable in a given case) and prioritizing them.
    • vcov() is important for post-processing (but how should models with multiple options, e.g. models with zero-inflation and conditional components, handle this?)
    • the simulate() method is extremely useful (and I have a personal opinion that allowing it to be extended with newdata and newparams arguments makes it much more useful
  • agree with @eric-pedersen that consistent handling of the offset parameter is important
  • agree with @alexpghayes that it’s really useful to be able to get an ‘unevaluated’ or ‘unfitted’ version of a model object back (Julia does this in a sensible way, IIRC)
  • also agree with @alexpghayes that careful thought about saving training data: save? don’t save? allow an option? allow an option to have training data shared among several model objects? is important (cf. the butcher package). On the other hand, users need to know what they can and can’t expect to do with models once they’re fitted if they haven’t stored the training data
  • update() is very useful but generally very fragile
  • allow options for scaling/centering predictors?
  • important to be clear about what scale parameters are estimated on (constrained vs. unconstrained), i.e. what is the “link function” for a parameter, and what does coef() give you by default?
  • I don’t dislike BDR’s development notes as much as @alexpghayes does: I agree that they could emphasize different aspects, but there’s a lot of valuable stuff in there - especially the little-known details about handling data-dependent bases (see ?makepredictcall) …
1 Like

I am coming to this quite late, but wanted to address the “how” a little bit more than the discussion has so far.

  • I would love to see encouragement for new modeling packages to adopt boring vanilla data formats for input and output (nothing fancy or unique) and S3 methods. This contributes both to correctness and to interoperability with the ecosystem of post-processing packages that @bbolker mentioned. Certainly hardhat offers scaffolding for this, if modelers are interested in using it.

  • Clarity about data validation and data transformation (how are factors handled? so differently by so many models! scaling/centering? :woman_shrugging: who even knows when this happens) is so important. I see this as something to consider in implementation (good practices around validation and transformation) and documentation, i.e. “factor predictors are converted to…” in docs.

  • I haven’t seen much discussion here of just really straightforward testing for correctness and how to do it, although I guess that is in @alexpghayes’s blog post. That is what I really would like to see in peer review standards, more even than (what seem to me) more subtle issues around assumptions. Some ideas to suggest there are using simulated datasets, resampling real ones, etc.