Statistical Software: Bayesian Analyses

Hi all! This thread is for discussion of standards for peer review of Bayesian packages, one of our initial statistical package categories. We’re interested in the community’s input as to what standards should apply specifically to such packages.

Bayesian packages focus on Bayesian methods of statistical analyses, via input featuring coherent and unified representations of prior distributions and output representations of posterior distributions, often but not exclusively fit by MCMC approaches. Note that categories or non-exclusive; Bayesian packages may also be, for instance, Regression or Time Series packages.

We’d like your answers to these questions:

  • What should a Bayesian packages do so as to pass a peer review process?
  • How should Bayesian software do that?
  • What should be documented in Bayesian packages?
  • How should Bayesian packages be tested?

Don’t worry if your contributions might apply to other categories. We will look at suggestions across topics to see which should be elevated to more general standards. Please do comment on whether you think a standard should be required or recommended for a package to pass peer-review. We’ll use these comments as input for prioritizing standards in the future.

For some useful reference, see our background chapter. See also the CRAN Task View on Bayesian Inference, the BayesValidate package or the Stan team’s benchmarks. Feel free to drop other references in the thread!

1 Like

Side note: the approach in BayesValidate is fairly old and was found to have some problems. Some Stan affiliated people improved the methodology into something more robust and useful under the name “simulation based calibration” ([1804.06788] Validating Bayesian Inference Algorithms with Simulation-Based Calibration).

2 Likes

Some further considerations on approaches to assessing software for Bayesian analysis:

The core of a Bayesian software package is the estimation of posterior distributions. These are generally sampled from a range of possible distributional estimates, and so this procedure is also commonly referred to as sampling. Bayesian software can accordingly be distinguished between the two primary sub-categories of software which implements its own sampling procedures, and software which relies on external, pre-existing software to generate samples. The latter software often implements Bayesian procedures in order to develop domain-specific models, while the former tend to be more general in application.

Bayesian Software with Internal Sampling Procedures

Numerous software packages for sampling posterior distributions have been developed in most computer languages, and standard algorithms for doing so are both stable and well developed. The implementation of new sampling procedures may thus be presumed to reflect some advance in the underlying sampling procedures, the algorithmic implementation thereof, or both. Given this presumption, the following standards may be considered to apply:

  1. The sampling procedure should have some form of support, typically either through published literature or previous implementations in other languages.
  2. Evidence should be provided, either through tests or perhaps separate vignettes, of how the implemented sampling procedure compares with other procedures.

Other general points for standards might usefully include:

  • The use of conjugate prior relationships for common distributional forms?

Bayesian Software with External Sampling Procedures

Most packages which rely on other software to generate posterior samples are designed to construct specific classes of models, for which the following aspects may be considered in software assessment:

  • Documentation of sampling procedure(s) and associated packages used.
  • Documentation of how to use or access different sampling procedures (where appropriate)?
  • Are methods implemented, or is guidance provided, for hyperparameter tuning?
  • Are improper prior distributions admitted? If so, are or can appropriate diagnostic messages be provided?
  • Is guidance provided for selecting appropriate prior distributions?
  • Are messages or warnings issued for low effective sample sizes for posterior distributions?
  • Do tests examine sensitivity to different forms of prior distributions?
  • Are any restrictions on admissible forms of prior distribution clearly explained? Do non-admissible forms generate appropriate messages, warnings, or errors?
  • Do tests consider and compare multiple sets of input data?

Many of these points may also be considered for the first category, and perhaps be best considered general points, with the preceding category detailing only points applying specifically to software which implements its own sampling procedures.

General Standards

Regardless of the above categories, Bayesian software may be generally expected to:

  • Implement output structures or classes which are as compatible as possible with other packages for Bayesian analyses (obviously rstan is a special case here, and so other packages might also rightly be expected to only loosely adhere to any such requirements or suggestions)
  • Write tests and/or vignettes to demonstrate conversion between output classes from various packages
  • Report and provide methods to extract prior distributional (hyper-)parameters from output class
  • Report and provide methods to extract sampling parameters from output class
  • Clearly describe how missing values are handled
  • Clearly describe acceptable types of input data
  • Clearly describe the structure of output data (including Class implementations where appropriate)
  • Either implement generic plot and summary methods for return objects, or ensure object class structures work with external plot and summary methods.

Diagnostic procedures

The performance of Bayesian algorithms and models may also be assessed via a number of diagnostic routines. It may be useful to recommend the use of one or more of the following:

  • The use of coda diagnostics
  • The use of Simulation-Based Calibration for rstan models, through the rstan::sbc() function, or equivalent procedures for other packages.
  • Messages or warnings for extreme scale reduction factors (R_hat << 1 | R_hat >> 1)
  • Presence of additional diagnostic procedures such as for convergence of estimates
  • Convergence of posterior distributions with maximum likelihood estimates for flat prior distributions.
1 Like

Should there also be an option to obtain samples from the prior distribution? Or to get a complete “pre-data” version of whatever the posterior object looks like?

Diagnostic procedures

Perhaps is samples are returned they should be returned in such a way that samples from separate chains (if used) are labeled as such?

How should Bayesian packages be tested?

We (Stan devs) are developing a testing framework (bayesbench). We’ve started by creating a posteriordb repository and tools. The posteriordb repository contains data and models to produce posteriors based on different probabilistic programming languages (PPL). The planned use cases are

  • Testing implementations of inference algorithms with asymptotic bias (such as Variational inference)
  • Software testing
    • Testing implementations of inference algorithms with asymptotically decreasing bias and variance (such as MCMC)
    • Unit testing
    • System testing
    • Performance testing
    • Regression testing
  • Efficiency comparisons of inference algorithms with asymptotically decreasing bias and variance
  • Explorative analysis of algorithms
  • Developing new algorithms for interesting models
  • Code examples
  • Models and data for teaching

See more at GitHub - stan-dev/posteriordb: Database with posteriors of interest for Bayesian inference and especially the use cases document.

We have currently support for R and Python. Currently most models are in Stan, but we have examples how to include models using other frameworks like PyMC3 and TF Probability.

EDIT: We are very happy to get more people involved or to discuss how we can support more frameworks.

3 Likes

A small point on what evidence could include here because it comes up pretty often when people compare against Stan. I’ve been dawdling about a larger writeup about this, but the things we care about wrt mcmc performance benchmarks are pretty different than other styles of algorithms

Comparing Procedures/Implementations

ESS/Second: Taking N samples after MCMC chains have “converged” (big air quotes there) how many effective samples do you get a second?

ESS/warmup: The hurdle rate, i.e. given M warmup steps and N samples, how many effective samples do I get back? I first saw this in a nimble vignette and am suprised it never caught on!

SBC: Over a handful of models you should be able to recover parameters and if not be able to explain why.

Raw Wall Time: Almost never. The only time you can ever really use this is if you are using the exact same software but changing mechanical things like memory structure or hardware.

Comparing Different Procedure Implementations

gradient evaluations a second: For sampling procedures that need to take a gradient I’ve found this useful (see figure 6 here) Take the number of gradient evaluations the model performs over the total time the chains ran. The usefulness of this measure is conditional on the implementation performing with the first two metrics as well. This can be nice when you are comparing the same procedure/algorithms across different PPLs

Anyway, I’m pretty sure posteriordb handles a lot of these (@avehtari?)

1 Like

Thanks @avehtari and @SteveBronder, especially for drawing our attention to the posteriordb package. That is very much the kind of thing we’ve had in mind for this project in general - reference sets of tried-and-tested data to use for algorithmic calibration, verification, and testing. We will very likely adopt and adapt those data for our project, including adapting for application beyond stan alone. We’ll likely proceed by combining that kind of test specificity with the kind of general structure exemplified by the test suite of the python botorch package for Bayesian optimization.

Efficiency concerns might not be a high priority in our initial phase, because our focus will remain on scientific statistical software, and we anticipate efforts to develop and implement new algorithms which may not initially be coded for efficiency alone. What we may pursue instead are some kinds of structured standards for comparing and contrasting efficiency.

Here is a provisional draft of standards for Bayesian software - please comment! All those interested are also invited to edit these standards on this hackmd.io document. Note that the following list is intended to illustrate the nature of standards we envision implementing, and to serve as a concrete starting point for further discussion, rather than to be an exhaustive list. We are particularly interested in hearing opinions regarding aspects which we may have missed here.

Bayesian Software

Standards

Many of the following standards are written with reference to how software should function. Such aspects can and should often also be tested. Where testing of described functionality is expected, a “(TEST)” is added to the description.

Standards regarding documentation imply doing so at appropriate places within the software; either within functions themselves, within extended Vignettes, within the main README document of a package, or elsewhere.

Class Systems

  • Bayesian Software should use or implement explicit class systems
  • Bayesian Software may extend common class systems used in other packages for Bayesian analysis; see the CRAN Task view on Bayesian Inference. CRAN Task View: Bayesian Inference) for reference to other packages and class systems.
  • Class systems should offer default summary and plot methods

General

Bayesian software should:

  • Return appropriate metadata documenting types (or classes) and dimensions of input data
  • Either return the input function with results, or enable direct access to such via additional functions which accept the return object as single argument.
  • Implement convergence checkers and enable computations to be stopped on convergence
  • Ensure appropriate mechanisms are provided for models which do not converge. This is often achieved by having a default value for numbers of iterations, and stopping there if not converged (TEST).
  • Return convergence statistics or equivalent (such as full samples before and after convergence), to allow statistical comparison between different sampling algorithms (TEST)
  • Return appropriate values sufficient to indicate absence of convergence (TEST)
  • Allow convergence checkers to be over-ridden in order to generate chains of pre-specified lengths (TEST)
  • Allow fine control of output verbosity, including an option to suppress all output messages. This should generally be achieved through either at least two parameters, or one parameter specifying multiple levels of output. One minimal level of output should determine whether normal output (for example of computational progress) is produced or not, and a second should determine whether useful diagnostic messages (for example, regarding convergence, or missing-value substitution) are produced or not.

Documentation

Bayesian software should document:

  • How to specify and modify prior distributions
  • How to extract and compare convergence statistics
  • How to use results of function to restart chain(s) at previous point
  • How to use different sampling algorithms
  • How to use stopping criteria

Hyperparameters

Bayesian software should

  • Only use the term “hyperparameter” to refer to parameters of prior distributions, and not to other parameters controlling the computational algorithm (which are described further below).

Bayesian software should implement the following pre-processing checks on hyperparameters:

  • Ensure that hyperparameter vectors are commensurate with expected model input (TEST).
  • Implement relevant pre-processing checks on hyperparameters, particularly on all second-order hyperparameters (for example, variance or shape hyperparameters) to ensure non-negativity (TEST).

Consider the following example of a log-likelihood estimator for a linear regression, controlled via a vector of three parameters, p:

ll <- function (x, y, p) dnorm (y - (p[1] + x * p[2]), sd = p[3], log = TRUE)

Pre-processing stages should be used to determine:

  1. Numbers of parameters represented by any of the input arguments to the main log-likelihood function (see example below) (TEST).
  2. That the length of the vector of hyperparameters matches this length (TEST).
  3. That dimensions of input data are commensurate; typically by ensuring that inputs and outputs have same numbers of rows or same lengths; non-commensurate inputs should error by default (TEST).

For example, the rstan package only allows named model parameters, and so largely avoids problems of mismatched lengths of parameter vectors, yet does not ensure the existence of named parameters prior to starting the computational chains. This ultimately results in each chain generating an error when non-existent parameters are named. Such controls should be part of a single pre-processing stage, and so should only generate a single error.

Determining numbers of parameters represented by input arguments requires parsing the underlying function objects passed to internal algorithms. Although not a trivial operation, the following code illustrates how this can be achieved. Presume the above line defines ll as an input function defining the log-likelihood of a linear regression model. R has very good utilities for parsing function calls, primarily through the getParseData function from the utils package. The parse data for a function can be extracted with the following lines:

p <- parse (deparse (ll))
x <- getParseData (p)

The object x is a data.frame of every R token (such as an expression, symbol, or operator) parsed from the function ll. Input arguments used to define parameter vectors in any R software are accessed through R’s standard vector access syntax of vec[i], for some element i of a vector vec. The parse data for such begins with the SYMBOL of vec, the [, a NUM_CONST for the value of i, and a closing ]. The following code can be used to extract elements of the parse data which match this pattern, and ultimately to extract the various values of i used to access members of vec.

vector_length <- function (x, i) {
    xn <- x [which (x$token %in% c ("SYMBOL", "NUM_CONST", "'['", "']'")), ]
    # split resultant data.frame at first "SYMBOL" entry
    xn <- split (xn, cumsum (xn$token == "SYMBOL"))
    # reduce to only those matching the above pattern
    xn <- xn [which (vapply (xn, function (j)
                             j$text [1] == i & nrow (j) > 3,
                             logical (1)))]
    ret <- NA_integer_ # default return value
    if (length (xn) > 0) {
        # get all values of NUM_CONST as integers
        n <- vapply (xn, function (j)
                         as.integer (j$text [j$token == "NUM_CONST"] [1]),
                         integer (1), USE.NAMES = FALSE)
        # and return max of these
        ret <- max (n)
    }
    return (ret)
}

That function can then be used to determine the length of any inputs which are used as hyperparameter vectors:

ll <- function (p, x, y) dnorm (y - (p[1] + x * p[2]), sd = p[3], log = TRUE)
p <- parse (text = deparse (ll))
x <- utils::getParseData (p)

# extract the names of the parameters:
params <- unique (x$text [x$token == "SYMBOL"])
lens <- vapply (params, function (i) vector_length (x, i), integer (1))
lens
#>  y  p  x 
#> NA  3 NA

And the vector p is used as a hyperparameter vector containing three parameters. Any initial value vectors can then be examined to ensure that they have this same length.

Computational Parameters

Computational parameters are considered here as those passed to Bayesian functions other than hyperparameters determining the forms of prior distributions. They typically include parameters controlling lengths of runs, lengths of burn-in periods, numbers of parallel computations, other parameters controlling how samples are to be generated, or convergence criteria.

All Computational Parameters should be checked for general “sanity” prior to calling primary computational algorithms. Standard sanity checks include:

  • Checks that values are positive (for all length-type parameters) (TEST)
  • Checks that arguments are of expected classes or types (for example, check that integer-type arguments are indeed integer, with explicit conversion via as.integer where not) (TEST).
  • Automatic rejection of parameters of inappropriate type (for example character values passed for integer-type parameters).
  • Either automatic rejection or appropriate diagnostic messaging for parameters of inappropriate length; for example passing a vector of length > 1 to a parameter presumed to define a single value.

Data Structure

Bayesian software should:

  • Implement pre-processing routines to diagnose perfect collinearity, and provide appropriate diagnostic messages or warnings (TEST).
  • Provide distinct routines for processing perfectly collinear data, potentially bypassing sampling algorithms (TEST).
    • An appropriate test would confirm that system.time() or equivalent timing expressions for perfectly collinear data should be less than equivalent routines called with non-collinear data.
    • Alternatively, a test could ensure that perfectly collinear data passed to a function with a stopping criteria generated no results, while specifying a fixed number of iterations may generate results.
  • Provide options for users to specify how to handle missing data, with options minimally including:
    • warn or error on missing data (TEST), in which case appropriate documentation should be provided.
    • either remove or replace missing data, with the latter option further enabling specification of how values may be imputed (TEST).

Seed control

Bayesian software should:

  • Enable seeds to be passed as a parameter, either explicitly (a direct seed argument or similar) or implicitly (via starting values for chain) (TEST)
  • Enable results of previous runs to be used as starting points for subsequent runs
  • Include seed(s) or starting value(s) within return object (TEST)

Software which implements parallel processing should:

  • Ensure each chain is started with a different seed by default (TEST)
  • Issue diagnostic messages when same seeds are passed to distinct computational chains (TEST)
  • Explicitly document advice not to use set.seed()
  • Provide the parameter with a plural name: for example, “starting_values” and not “starting_value”
  • Include the seeds or starting values for each chain within return object (TEST)

To avoid potential confusion between separate parameters to control random seeds and starting values, we recommended a single “starting values” rather than “seeds” argument, which appropriate translation of these parameters into seeds where necessary.

Testing

Bayesian software should implement the following tests (where appropriate):

Sensitivity to hyperparameters

As an example of testing sensitivity to hyperparameters, consider the linear regression function from above,

ll <- function (x, y, p) dnorm (y - (p[1] + x * p[2]), sd = p[3], log = TRUE)

The hyperparameters are expressed as a single vector, p, with three elements. The first two values of p are the intercept and slope, the scales of which are entirely arbitrary within that function. Posterior distributions for these parameters should therefore be statistically independent of starting values. Such independence can be tested by writing tests to statistically compare posterior distributions using different initial values, and expecting no significant difference (for example, via a simple t.test).

Convergence

Convergence should be tested in the following two ways:

  1. Convergence checkers should be tested by ensuring that algorithms with convergence checkers stop at some point less than an otherwise nominated number of steps, or less than a result run without a convergence checker.
  2. Convergence should then be tested by using the results from the above test as the starting point for an additional run of the same fixed length (that is, without automatic convergence checking), and ensuring that the two posterior distributions are statistically equivalent.

Where convergence checkers are themselves parameterised, the effects of such parameters should also be tested. For threshold parameters, for example, lower values should taken result in longer chains needed for convergence, and that should be tested.

Internal Sampling Algorithms

Packages which implement their own sampling algorithms should document

  • Support for algorithm (via citation, or reference to other software)
  • Comparisons with external samplers which demonstrate intended advantage of implementation

We now have preliminary drafts of both general standards for statistical software, and specific extensions into a few initial categories - Regression, Bayesian and Monte Carlo, Exploratory Data Analysis, and Time Series. We would very much appreciate any feedback, comments, suggestions, improvements to any and all of the current standards. Everybody is encouraged to peruse the single “master” document in bookdown form, and to provide feedback in increasing orders of formality in one or more of the following ways:

  1. The #stats-peer-review slack channel
  2. The relevant sections of the
    discussion forum
  3. The github repository for the “master” document, either via issues for general discussion, or pull requests for more concrete improvements.

Note that we anticipate some degree of “shuffling” between general and category-specific standards, much of which will be deferred until we have developed standards for all of our anticipated categories. There is thus one and only one form of comment for which we are currently not seeking feedback, which is comments regard whether category-specific standard X might be better considered general standard Y - that will be worked out later.

Looking forward to any and all feedback from anybody interested in helping to create our system for peer reviewing statistical software. Without further ado, the relevant standards follow immediately below.


Standards for Bayesian Software

Bayesian and Monte Carlo Software (hereafter referred to for simplicity as “Bayesian Software”) is presumed to perform one or more of the following steps:

    1. Document how to specify inputs including:
    • 1.1 Data
    • 1.2 Hyperparameters determining prior distributions
    • 1.3 Parameters determining the computational processes
    1. Accept and validate all of forms of input
    1. Apply data transformation and pre-processing steps
    1. Apply one or more analytic algorithms, generally sampling algorithms used to generate estimates of posterior distributions
    1. Return the result of that algorithmic application
    1. Offer additional functionality such as printing or summarising return results

This document details standards for each of these steps, each prefixed with “BS”.

1. Documentation of Inputs

Prior to actual standards for documentation of inputs, we note one terminological standard for Bayesian software:

  • BS1.0 Bayesian software should use the term “hyperparameter” exclusively to refer to parameters determining the form of prior distributions, and should use either the generic term “parameter” or some conditional variant(s) such as “computation parameters” to refer to all other parameters.

Bayesian Software should provide the following documentation of how to specify inputs:

  • BS2.1 Description of how to enter data, both in textual form and via code examples. Both of these should consider the simplest cases of single objects representing independent and dependent data, and potentially more complicated cases of multiple independent data inputs.
  • BS3.1 Description of how to specify prior distributions, both in textual form describing the general principles of specifying prior distributions, along with more applied descriptions and examples, within:
    • B32.1a The main package README, either as textual description or example code
    • B32.1b At least one package vignette, both as general and applied textual descriptions, and example code
    • B32.1c Function-level documentation, preferably with code included in examples
  • B43.1 Description of all parameters which control the computational process (typically those determining aspects such as numbers and lengths of sampling processes, seeds used to start them, thinning parameters determining post-hoc sampling from simulated values, and convergence criteria). In particular:
    • B43.1a Bayesian Software should document, both in text and examples, how to use the output of previous simulations as starting points of subsequent simulations.
    • B43.1b Where applicable, Bayesian software should document, both in text and examples, how to use different sampling algorithms for a given model.
  • BS5.1 For Bayesian Software which implements or otherwise enables convergence checkers, documentation should explicitly describe and provide examples of use with and without convergence checkers.
  • BS5.2 For Bayesian Software which implements or otherwise enables multiple convergence checkers, differences between these should be explicitly tested.

2. Input Data Structures and Validation

This section contains standards primarily intended to ensure that input data, including model specifications, are validated prior to passing through to the main computational algorithms.

2.1 Input Data

Bayesian Software is commonly designed to accept generic one- or two-dimensional forms such as vector, matrix, or data.frame objects. The first standards concerns the range of possible generic forms for input data:

  • BS2.1 Bayesian Software which accepts one-dimensional input should ensure values are appropriately pre-processed regardless of class structures. The units package provides a good example, in creating objects that may be treated as vectors, yet which have a class structure that does not inherit from the vector class. Using these objects as input often causes software to fail. The storage.mode of the underlying objects may nevertheless be examined, and the objects transformed or processed accordingly to ensure such inputs do not lead to errors.
  • BS2.2 Bayesian Software which accepts two-dimension input should implement pre-processing routines to ensure conversion of as many possible forms as possible to some standard format which is then passed to all analytic functions. In particular, tests should demonstrate that:
    • BS2.2a data.frame or equivalent objects which have columns which do not themselves have standard class attributes (typically, vector) are appropriately processed, and do not error without reason. This behaviour should be tested. Again, columns created by the units package provide a good test case.
    • BS2.2b data.frame or equivalent objects which have list columns should ensure that those columns are appropriately pre-processed either through being removed, converted to equivalent vector columns where appropriate, or some other appropriate treatment. This behaviour should be tested.
  • BS2.3 Bayesian Software should implement pre-processing routines to ensure all input data is dimensionally commensurate, for example by ensuring commensurate lengths of vectors or numbers of rows of rectangular inputs.

2.2 Prior Distributions, Model Specifications, and Hyperparameters

The second set of standards in this section concern specification of prior distributions, model structures, or other equivalent ways of specifying hypothesised relationships among input data structures. R already has a diverse range of Bayesian Software with distinct approaches to this task, commonly either through specifying a model as a character vector representing an R function, or an external file either as R code, or encoded according to some alternative system (such as for rstan).

As explicated above, the term “hyperparameters” is interpreted here to refer to parameters which define prior distributions, while a “model specification”, or simply “model”, is an encoded description of how those hyperparameters are hypothesised to transform to a posterior distribution.

Bayesian Software should:

  • BS2.4 Ensure that all appropriate validation and pre-processing of hyperparameters are implemented as distinct pre-processing steps prior to submitting to analytic routines, and especially prior to submitting to multiple parallel computational chains.
  • BS2.5 Ensure that lengths of hyperparameter vectors are checked, with no excess values silently discarded (unless such output is explicitly suppressed, as detailed below).
  • BS2.6 Ensure that lengths of hyperparameter vectors are commensurate with expected model input (see example immediately below)
  • BS2.7 Where possible, implement pre-processing checks to validate appropriateness of numeric values submitted for hyperparameters; for example, by ensuring that hyperparameters defining second-order moments such as distributional variance or shape parameters, or any parameters which are logarithmically transformed, are non-negative.

The following example demonstrates how standards like the above (BS2.6-2.7) might be addressed. Consider the following function which defines a log-likelihood estimator for a linear regression, controlled via a vector of three hyperparameters, p:

ll <- function (x, y, p) dnorm (y - (p[1] + x * p[2]), sd = p[3], log = TRUE)

Pre-processing stages should be used to determine:

  1. That the dimensions of the input data, x and y, are commensurate (BS2.3); non-commensurate inputs should error by default.
  2. The length of the vector p (BS2.5)

The latter task is not necessarily straightforward, because the definition of the function, ll(), will itself generally be part of the input to an actual Bayesian Software function. This functional input thus needs to be examined to determine expected lengths of hyperparameter vectors. The following code illustrates one way to achieve this, relying on utilities for parsing function calls in R, primarily through the getParseData function from the utils package. The parse data for a function can be extracted with the following line:

x <- getParseData (parse (text = deparse (ll)))

The object x is a data.frame of every R token (such as an expression, symbol, or operator) parsed from the function ll. The following section illustrates how this data can be used to determine the expected lengths of vector inputs to the function, ll().

click to see details

Input arguments used to define parameter vectors in any R software are accessed through R’s standard vector access syntax of vec[i], for some element i of a vector vec. The parse data for such begins with the SYMBOL of vec, the [, a NUM_CONST for the value of i, and a closing ]. The following code can be used to extract elements of the parse data which match this pattern, and ultimately to extract the various values of i used to access members of vec.

vector_length <- function (x, i) {
    xn <- x [which (x$token %in% c ("SYMBOL", "NUM_CONST", "'['", "']'")), ]
    # split resultant data.frame at first "SYMBOL" entry
    xn <- split (xn, cumsum (xn$token == "SYMBOL"))
    # reduce to only those matching the above pattern
    xn <- xn [which (vapply (xn, function (j)
                             j$text [1] == i & nrow (j) > 3,
                             logical (1)))]
    ret <- NA_integer_ # default return value
    if (length (xn) > 0) {
        # get all values of NUM_CONST as integers
        n <- vapply (xn, function (j)
                         as.integer (j$text [j$token == "NUM_CONST"] [1]),
                         integer (1), USE.NAMES = FALSE)
        # and return max of these
        ret <- max (n)
    }
    return (ret)
}

That function can then be used to determine the length of any inputs which are used as hyperparameter vectors:

ll <- function (p, x, y) dnorm (y - (p[1] + x * p[2]), sd = p[3], log = TRUE)
p <- parse (text = deparse (ll))
x <- utils::getParseData (p)

# extract the names of the parameters:
params <- unique (x$text [x$token == "SYMBOL"])
lens <- vapply (params, function (i) vector_length (x, i), integer (1))
lens
#>  y  p  x 
#> NA  3 NA

And the vector p is used as a hyperparameter vector containing three parameters. Any initial value vectors can then be examined to ensure that they have this same length.



Not all Bayesian Software is designed to accept model inputs expressed as R code. The rstan package, for example, implements its own model specification language, and only allows hyperparameters to be named, and not addressed by index. While this largely avoids problems of mismatched lengths of parameter vectors, the software (at v2.21.1) does not ensure the existence of named parameters prior to starting the computational chains. This ultimately results in each chain generating an error when a model specification refers to a non-existent or undefined hyperparameter. Such controls should be part of a single pre-processing stage, and so should only generate a single error.

2.3 Computational Parameters

Computational parameters are considered here as those passed to Bayesian functions other than hyperparameters determining the forms of prior distributions. They typically include parameters controlling lengths of runs, lengths of burn-in periods, numbers of parallel computations, other parameters controlling how samples are to be generated, or convergence criteria. All Computational Parameters should be checked for general “sanity” prior to calling primary computational algorithms. The standards for such sanity checks include that Bayesian Software should:

  • BS2.8 Check that values for parameters are positive (except where negative values may be accepted)
  • BS2.9 Check lengths and/or dimensions of inputs, and either automatically reject or provide appropriate diagnostic messaging for parameters of inappropriate length or dimension; for example passing a vector of length > 1 to a parameter presumed to define a single value (unless such output is explicitly suppressed, as detailed below)
  • BS2.10 Check that arguments are of expected classes or types (for example, check that integer-type arguments are indeed integer, with explicit conversion via as.integer where not)
  • BS2.11 Automatically reject parameters of inappropriate type (for example character values passed for integer-type parameters that are unable to be appropriately converted).

The following two sub-sections consider particular cases of computational parameters.

2.4 Seed Parameters

Bayesian software should:

  • BS2.12 Enable seeds to be passed as a parameter (through a direct seed argument or similar), or as a vector of parameters, one for each chain.
  • BS2.13 Enable results of previous runs to be used as starting points for subsequent runs

Bayesian Software which implements parallel processing should:

  • BS2.14 Ensure each chain is started with a different seed by default
  • BS2.15 Issue diagnostic messages when identical seeds are passed to distinct computational chains
  • BS2.16 Explicitly document advice not to use set.seed()
  • BS2.17 Provide the parameter with a plural name: for example, “starting_values” and not “starting_value”

To avoid potential confusion between separate parameters to control random seeds and starting values, we recommended a single “starting values” rather than “seeds” argument, with appropriate translation of these parameters into seeds where necessary.

2.5 Output Verbosity

All Bayesian Software should implement computational parameters to control output verbosity. Bayesian computations are often time-consuming, and often performed as batch computations. The following standards should be adhered to in regard to output verbosity:

  • BS2.18 Bayesian Software should implement at least one parameter controlling the verbosity of output, defaulting to verbose output of all appropriate messages, warnings, errors, and progress indicators.
  • BS2.19 Bayesian Software should enable suppression of messages and progress indicators, while retaining verbosity of warnings and errors. This should be tested.
  • BS2.20 Bayesian Software should enable suppression of warnings where appropriate. This should be tested.
  • BS2.21 Bayesian Software should explicitly enable errors to be caught, and appropriately processed either through conversion to warnings, or otherwise captured in return values. This should be tested.

3. Pre-processing and Data Transformation

3.1 Missing Values

Bayesian Software should:

  • BS3.1 Explicitly document assumptions made in regard to missing values; for example that data is assumed to contain no missing (NA, Inf) values, and that such values, or entire rows including any such values, will be automatically removed from input data.
  • BS3.2 Implement appropriate routines to pre-process missing values prior to passing data through to main computational algorithms.

3.2 Perfect Collinearity

Where appropriate, Bayesian Software should:

  • BS3.3 Implement pre-processing routines to diagnose perfect collinearity, and provide appropriate diagnostic messages or warnings
  • BS3.4 Provide distinct routines for processing perfectly collinear data, potentially bypassing sampling algorithms

An appropriate test for BS3.4 would confirm that system.time() or equivalent timing expressions for perfectly collinear data should be less than equivalent routines called with non-collinear data. Alternatively, a test could ensure that perfectly collinear data passed to a function with a stopping criteria generated no results, while specifying a fixed number of iterations may generate results.

4. Analytic Algorithms

As mentioned, analytic algorithms for Bayesian Software are commonly algorithms to simulate posterior distributions, and to draw samples from those simulations. Numerous extent R packages implement and offer sampling algorithms, and not all Bayesian Software will internally implement sampling algorithms. The following standards apply to packages which do implement internal sampling algorithms:

  • BS4.1 Packages should document sampling algorithms (generally via literary citation, or reference to other software)
  • BS4.2 Packages should provide explicit comparisons with external samplers which demonstrate intended advantage of implementation (generally via tests, vignettes, or both).

Regardless of whether or not Bayesian Software implements internal sampling algorithms, it should:

  • BS4.3 Implement at least one means to validate posterior estimates (for example through the functionality of the BayesValidate package, noting that that package has not been updated for almost 15 years, and such approaches may need adapting; or the Simulation Based Calibration approach implemented in the rstan function sbc).

Where possible or applicable, Bayesian Software should:

  • BS4.4 Implement at least one type of convergence checker, and provide a documented reference for that implementation.
  • BS4.5 Enable computations to be stopped on convergence (although not necessarily by default).
  • BS4.6 Ensure that appropriate mechanisms are provided for models which do not converge. This is often achieved by having default behaviour to stop after specified numbers of iterations regardless of convergence.
  • BS4.7 Implement tests to confirm that results with convergence checker are statistically equivalent to results from equivalent fixed number of samples without convergence checking.
  • BS4.8 Where convergence checkers are themselves parametrised, the effects of such parameters should also be tested. For threshold parameters, for example, lower values should result in longer sequence lengths.

5. Return Values

Unlike software in many other categories, Bayesian Software should generally return several kinds of distinct data, both the raw data derived from statistical algorithms, and associated metadata. Such distinct and generally disparate forms of data will be generally best combined into a single object through implementing a defined class structure, although other options are possible, including (re-)using extant class structures (see the CRAN Task view on Bayesian Inference. CRAN Task View: Bayesian Inference) for reference to other packages and class systems). Regardless of the precise form of return object, and whether or not defined class structures are used or implemented, the objects returned from Bayesian Software should include:

  • BS5.1 Seed(s) or starting value(s), including values for each sequences where multiple sequences are included
  • BS5.2 Appropriate metadata on types (or classes) and dimensions of input data

With regard to the input function, or alternative means of specifying prior distributions:

  • BS5.3 Bayesian Software should either:
    • BS5.3a Return the input function or prior distributional specification in the return object; or
    • BS5.3b Enable direct access to such via additional functions which accept the return object as single argument.

Where convergence checkers are implemented or provided, Bayesian Software should:

  • BS5.4 Return convergence statistics or equivalent
  • BS5.5 Where multiple checkers are enabled, return details of convergence checker used
  • BS5.6 Appropriate diagnostic statistics to indicate absence of convergence are either returned or immediately able to be accessed.

6. Additional Functionality

Bayesian Software should:

  • BS6.1 Implement a default print method for return objects
  • BS6.2 Implement a default plot method for return objects
  • BS6.3 Provide and document straightforward abilities to plot sequences of posterior samples, with burn-in periods clearly distinguished
  • BS6.4 Provide and document straightforward abilities to plot posterior distributional estimates

Bayesian Software may:

  • BS6.5 Provide summary methods for return objects
  • BS6.6 Provide abilities to plot both sequences of posterior samples and distributional estimates together in single graphic