Statistical Software: Bayesian Analyses

Tags: #<Tag:0x00007fc004948628>

Hi all! This thread is for discussion of standards for peer review of Bayesian packages, one of our initial statistical package categories. We’re interested in the community’s input as to what standards should apply specifically to such packages.

Bayesian packages focus on Bayesian methods of statistical analyses, via input featuring coherent and unified representations of prior distributions and output representations of posterior distributions, often but not exclusively fit by MCMC approaches. Note that categories or non-exclusive; Bayesian packages may also be, for instance, Regression or Time Series packages.

We’d like your answers to these questions:

  • What should a Bayesian packages do so as to pass a peer review process?
  • How should Bayesian software do that?
  • What should be documented in Bayesian packages?
  • How should Bayesian packages be tested?

Don’t worry if your contributions might apply to other categories. We will look at suggestions across topics to see which should be elevated to more general standards. Please do comment on whether you think a standard should be required or recommended for a package to pass peer-review. We’ll use these comments as input for prioritizing standards in the future.

For some useful reference, see our background chapter. See also the CRAN Task View on Bayesian Inference, the BayesValidate package or the Stan team’s benchmarks. Feel free to drop other references in the thread!

1 Like

Side note: the approach in BayesValidate is fairly old and was found to have some problems. Some Stan affiliated people improved the methodology into something more robust and useful under the name “simulation based calibration” ([1804.06788] Validating Bayesian Inference Algorithms with Simulation-Based Calibration).


Some further considerations on approaches to assessing software for Bayesian analysis:

The core of a Bayesian software package is the estimation of posterior distributions. These are generally sampled from a range of possible distributional estimates, and so this procedure is also commonly referred to as sampling. Bayesian software can accordingly be distinguished between the two primary sub-categories of software which implements its own sampling procedures, and software which relies on external, pre-existing software to generate samples. The latter software often implements Bayesian procedures in order to develop domain-specific models, while the former tend to be more general in application.

Bayesian Software with Internal Sampling Procedures

Numerous software packages for sampling posterior distributions have been developed in most computer languages, and standard algorithms for doing so are both stable and well developed. The implementation of new sampling procedures may thus be presumed to reflect some advance in the underlying sampling procedures, the algorithmic implementation thereof, or both. Given this presumption, the following standards may be considered to apply:

  1. The sampling procedure should have some form of support, typically either through published literature or previous implementations in other languages.
  2. Evidence should be provided, either through tests or perhaps separate vignettes, of how the implemented sampling procedure compares with other procedures.

Other general points for standards might usefully include:

  • The use of conjugate prior relationships for common distributional forms?

Bayesian Software with External Sampling Procedures

Most packages which rely on other software to generate posterior samples are designed to construct specific classes of models, for which the following aspects may be considered in software assessment:

  • Documentation of sampling procedure(s) and associated packages used.
  • Documentation of how to use or access different sampling procedures (where appropriate)?
  • Are methods implemented, or is guidance provided, for hyperparameter tuning?
  • Are improper prior distributions admitted? If so, are or can appropriate diagnostic messages be provided?
  • Is guidance provided for selecting appropriate prior distributions?
  • Are messages or warnings issued for low effective sample sizes for posterior distributions?
  • Do tests examine sensitivity to different forms of prior distributions?
  • Are any restrictions on admissible forms of prior distribution clearly explained? Do non-admissible forms generate appropriate messages, warnings, or errors?
  • Do tests consider and compare multiple sets of input data?

Many of these points may also be considered for the first category, and perhaps be best considered general points, with the preceding category detailing only points applying specifically to software which implements its own sampling procedures.

General Standards

Regardless of the above categories, Bayesian software may be generally expected to:

  • Implement output structures or classes which are as compatible as possible with other packages for Bayesian analyses (obviously rstan is a special case here, and so other packages might also rightly be expected to only loosely adhere to any such requirements or suggestions)
  • Write tests and/or vignettes to demonstrate conversion between output classes from various packages
  • Report and provide methods to extract prior distributional (hyper-)parameters from output class
  • Report and provide methods to extract sampling parameters from output class
  • Clearly describe how missing values are handled
  • Clearly describe acceptable types of input data
  • Clearly describe the structure of output data (including Class implementations where appropriate)
  • Either implement generic plot and summary methods for return objects, or ensure object class structures work with external plot and summary methods.

Diagnostic procedures

The performance of Bayesian algorithms and models may also be assessed via a number of diagnostic routines. It may be useful to recommend the use of one or more of the following:

  • The use of coda diagnostics
  • The use of Simulation-Based Calibration for rstan models, through the rstan::sbc() function, or equivalent procedures for other packages.
  • Messages or warnings for extreme scale reduction factors (R_hat << 1 | R_hat >> 1)
  • Presence of additional diagnostic procedures such as for convergence of estimates
  • Convergence of posterior distributions with maximum likelihood estimates for flat prior distributions.
1 Like

Should there also be an option to obtain samples from the prior distribution? Or to get a complete “pre-data” version of whatever the posterior object looks like?

Diagnostic procedures

Perhaps is samples are returned they should be returned in such a way that samples from separate chains (if used) are labeled as such?

How should Bayesian packages be tested?

We (Stan devs) are developing a testing framework (bayesbench). We’ve started by creating a posteriordb repository and tools. The posteriordb repository contains data and models to produce posteriors based on different probabilistic programming languages (PPL). The planned use cases are

  • Testing implementations of inference algorithms with asymptotic bias (such as Variational inference)
  • Software testing
    • Testing implementations of inference algorithms with asymptotically decreasing bias and variance (such as MCMC)
    • Unit testing
    • System testing
    • Performance testing
    • Regression testing
  • Efficiency comparisons of inference algorithms with asymptotically decreasing bias and variance
  • Explorative analysis of algorithms
  • Developing new algorithms for interesting models
  • Code examples
  • Models and data for teaching

See more at GitHub - MansMeg/posteriordb: Database with posteriors of interest for Bayesian inference and especially the use cases document.

We have currently support for R and Python. Currently most models are in Stan, but we have examples how to include models using other frameworks like PyMC3 and TF Probability.

EDIT: We are very happy to get more people involved or to discuss how we can support more frameworks.


A small point on what evidence could include here because it comes up pretty often when people compare against Stan. I’ve been dawdling about a larger writeup about this, but the things we care about wrt mcmc performance benchmarks are pretty different than other styles of algorithms

Comparing Procedures/Implementations

ESS/Second: Taking N samples after MCMC chains have “converged” (big air quotes there) how many effective samples do you get a second?

ESS/warmup: The hurdle rate, i.e. given M warmup steps and N samples, how many effective samples do I get back? I first saw this in a nimble vignette and am suprised it never caught on!

SBC: Over a handful of models you should be able to recover parameters and if not be able to explain why.

Raw Wall Time: Almost never. The only time you can ever really use this is if you are using the exact same software but changing mechanical things like memory structure or hardware.

Comparing Different Procedure Implementations

gradient evaluations a second: For sampling procedures that need to take a gradient I’ve found this useful (see figure 6 here) Take the number of gradient evaluations the model performs over the total time the chains ran. The usefulness of this measure is conditional on the implementation performing with the first two metrics as well. This can be nice when you are comparing the same procedure/algorithms across different PPLs

Anyway, I’m pretty sure posteriordb handles a lot of these (@avehtari?)

1 Like

Thanks @avehtari and @SteveBronder, especially for drawing our attention to the posteriordb package. That is very much the kind of thing we’ve had in mind for this project in general - reference sets of tried-and-tested data to use for algorithmic calibration, verification, and testing. We will very likely adopt and adapt those data for our project, including adapting for application beyond stan alone. We’ll likely proceed by combining that kind of test specificity with the kind of general structure exemplified by the test suite of the python botorch package for Bayesian optimization.

Efficiency concerns might not be a high priority in our initial phase, because our focus will remain on scientific statistical software, and we anticipate efforts to develop and implement new algorithms which may not initially be coded for efficiency alone. What we may pursue instead are some kinds of structured standards for comparing and contrasting efficiency.

Here is a provisional draft of standards for Bayesian software - please comment! All those interested are also invited to edit these standards on this document. Note that the following list is intended to illustrate the nature of standards we envision implementing, and to serve as a concrete starting point for further discussion, rather than to be an exhaustive list. We are particularly interested in hearing opinions regarding aspects which we may have missed here.

Bayesian Software


Many of the following standards are written with reference to how software should function. Such aspects can and should often also be tested. Where testing of described functionality is expected, a “(TEST)” is added to the description.

Standards regarding documentation imply doing so at appropriate places within the software; either within functions themselves, within extended Vignettes, within the main README document of a package, or elsewhere.

Class Systems

  • Bayesian Software should use or implement explicit class systems
  • Bayesian Software may extend common class systems used in other packages for Bayesian analysis; see the CRAN Task view on Bayesian Inference. CRAN Task View: Bayesian Inference) for reference to other packages and class systems.
  • Class systems should offer default summary and plot methods


Bayesian software should:

  • Return appropriate metadata documenting types (or classes) and dimensions of input data
  • Either return the input function with results, or enable direct access to such via additional functions which accept the return object as single argument.
  • Implement convergence checkers and enable computations to be stopped on convergence
  • Ensure appropriate mechanisms are provided for models which do not converge. This is often achieved by having a default value for numbers of iterations, and stopping there if not converged (TEST).
  • Return convergence statistics or equivalent (such as full samples before and after convergence), to allow statistical comparison between different sampling algorithms (TEST)
  • Return appropriate values sufficient to indicate absence of convergence (TEST)
  • Allow convergence checkers to be over-ridden in order to generate chains of pre-specified lengths (TEST)
  • Allow fine control of output verbosity, including an option to suppress all output messages. This should generally be achieved through either at least two parameters, or one parameter specifying multiple levels of output. One minimal level of output should determine whether normal output (for example of computational progress) is produced or not, and a second should determine whether useful diagnostic messages (for example, regarding convergence, or missing-value substitution) are produced or not.


Bayesian software should document:

  • How to specify and modify prior distributions
  • How to extract and compare convergence statistics
  • How to use results of function to restart chain(s) at previous point
  • How to use different sampling algorithms
  • How to use stopping criteria


Bayesian software should

  • Only use the term “hyperparameter” to refer to parameters of prior distributions, and not to other parameters controlling the computational algorithm (which are described further below).

Bayesian software should implement the following pre-processing checks on hyperparameters:

  • Ensure that hyperparameter vectors are commensurate with expected model input (TEST).
  • Implement relevant pre-processing checks on hyperparameters, particularly on all second-order hyperparameters (for example, variance or shape hyperparameters) to ensure non-negativity (TEST).

Consider the following example of a log-likelihood estimator for a linear regression, controlled via a vector of three parameters, p:

ll <- function (x, y, p) dnorm (y - (p[1] + x * p[2]), sd = p[3], log = TRUE)

Pre-processing stages should be used to determine:

  1. Numbers of parameters represented by any of the input arguments to the main log-likelihood function (see example below) (TEST).
  2. That the length of the vector of hyperparameters matches this length (TEST).
  3. That dimensions of input data are commensurate; typically by ensuring that inputs and outputs have same numbers of rows or same lengths; non-commensurate inputs should error by default (TEST).

For example, the rstan package only allows named model parameters, and so largely avoids problems of mismatched lengths of parameter vectors, yet does not ensure the existence of named parameters prior to starting the computational chains. This ultimately results in each chain generating an error when non-existent parameters are named. Such controls should be part of a single pre-processing stage, and so should only generate a single error.

Determining numbers of parameters represented by input arguments requires parsing the underlying function objects passed to internal algorithms. Although not a trivial operation, the following code illustrates how this can be achieved. Presume the above line defines ll as an input function defining the log-likelihood of a linear regression model. R has very good utilities for parsing function calls, primarily through the getParseData function from the utils package. The parse data for a function can be extracted with the following lines:

p <- parse (deparse (ll))
x <- getParseData (p)

The object x is a data.frame of every R token (such as an expression, symbol, or operator) parsed from the function ll. Input arguments used to define parameter vectors in any R software are accessed through R’s standard vector access syntax of vec[i], for some element i of a vector vec. The parse data for such begins with the SYMBOL of vec, the [, a NUM_CONST for the value of i, and a closing ]. The following code can be used to extract elements of the parse data which match this pattern, and ultimately to extract the various values of i used to access members of vec.

vector_length <- function (x, i) {
    xn <- x [which (x$token %in% c ("SYMBOL", "NUM_CONST", "'['", "']'")), ]
    # split resultant data.frame at first "SYMBOL" entry
    xn <- split (xn, cumsum (xn$token == "SYMBOL"))
    # reduce to only those matching the above pattern
    xn <- xn [which (vapply (xn, function (j)
                             j$text [1] == i & nrow (j) > 3,
                             logical (1)))]
    ret <- NA_integer_ # default return value
    if (length (xn) > 0) {
        # get all values of NUM_CONST as integers
        n <- vapply (xn, function (j)
                         as.integer (j$text [j$token == "NUM_CONST"] [1]),
                         integer (1), USE.NAMES = FALSE)
        # and return max of these
        ret <- max (n)
    return (ret)

That function can then be used to determine the length of any inputs which are used as hyperparameter vectors:

ll <- function (p, x, y) dnorm (y - (p[1] + x * p[2]), sd = p[3], log = TRUE)
p <- parse (text = deparse (ll))
x <- utils::getParseData (p)

# extract the names of the parameters:
params <- unique (x$text [x$token == "SYMBOL"])
lens <- vapply (params, function (i) vector_length (x, i), integer (1))
#>  y  p  x 
#> NA  3 NA

And the vector p is used as a hyperparameter vector containing three parameters. Any initial value vectors can then be examined to ensure that they have this same length.

Computational Parameters

Computational parameters are considered here as those passed to Bayesian functions other than hyperparameters determining the forms of prior distributions. They typically include parameters controlling lengths of runs, lengths of burn-in periods, numbers of parallel computations, other parameters controlling how samples are to be generated, or convergence criteria.

All Computational Parameters should be checked for general “sanity” prior to calling primary computational algorithms. Standard sanity checks include:

  • Checks that values are positive (for all length-type parameters) (TEST)
  • Checks that arguments are of expected classes or types (for example, check that integer-type arguments are indeed integer, with explicit conversion via as.integer where not) (TEST).
  • Automatic rejection of parameters of inappropriate type (for example character values passed for integer-type parameters).
  • Either automatic rejection or appropriate diagnostic messaging for parameters of inappropriate length; for example passing a vector of length > 1 to a parameter presumed to define a single value.

Data Structure

Bayesian software should:

  • Implement pre-processing routines to diagnose perfect collinearity, and provide appropriate diagnostic messages or warnings (TEST).
  • Provide distinct routines for processing perfectly collinear data, potentially bypassing sampling algorithms (TEST).
    • An appropriate test would confirm that system.time() or equivalent timing expressions for perfectly collinear data should be less than equivalent routines called with non-collinear data.
    • Alternatively, a test could ensure that perfectly collinear data passed to a function with a stopping criteria generated no results, while specifying a fixed number of iterations may generate results.
  • Provide options for users to specify how to handle missing data, with options minimally including:
    • warn or error on missing data (TEST), in which case appropriate documentation should be provided.
    • either remove or replace missing data, with the latter option further enabling specification of how values may be imputed (TEST).

Seed control

Bayesian software should:

  • Enable seeds to be passed as a parameter, either explicitly (a direct seed argument or similar) or implicitly (via starting values for chain) (TEST)
  • Enable results of previous runs to be used as starting points for subsequent runs
  • Include seed(s) or starting value(s) within return object (TEST)

Software which implements parallel processing should:

  • Ensure each chain is started with a different seed by default (TEST)
  • Issue diagnostic messages when same seeds are passed to distinct computational chains (TEST)
  • Explicitly document advice not to use set.seed()
  • Provide the parameter with a plural name: for example, “starting_values” and not “starting_value”
  • Include the seeds or starting values for each chain within return object (TEST)

To avoid potential confusion between separate parameters to control random seeds and starting values, we recommended a single “starting values” rather than “seeds” argument, which appropriate translation of these parameters into seeds where necessary.


Bayesian software should implement the following tests (where appropriate):

Sensitivity to hyperparameters

As an example of testing sensitivity to hyperparameters, consider the linear regression function from above,

ll <- function (x, y, p) dnorm (y - (p[1] + x * p[2]), sd = p[3], log = TRUE)

The hyperparameters are expressed as a single vector, p, with three elements. The first two values of p are the intercept and slope, the scales of which are entirely arbitrary within that function. Posterior distributions for these parameters should therefore be statistically independent of starting values. Such independence can be tested by writing tests to statistically compare posterior distributions using different initial values, and expecting no significant difference (for example, via a simple t.test).


Convergence should be tested in the following two ways:

  1. Convergence checkers should be tested by ensuring that algorithms with convergence checkers stop at some point less than an otherwise nominated number of steps, or less than a result run without a convergence checker.
  2. Convergence should then be tested by using the results from the above test as the starting point for an additional run of the same fixed length (that is, without automatic convergence checking), and ensuring that the two posterior distributions are statistically equivalent.

Where convergence checkers are themselves parameterised, the effects of such parameters should also be tested. For threshold parameters, for example, lower values should taken result in longer chains needed for convergence, and that should be tested.

Internal Sampling Algorithms

Packages which implement their own sampling algorithms should document

  • Support for algorithm (via citation, or reference to other software)
  • Comparisons with external samplers which demonstrate intended advantage of implementation