Statistical Software: General Standards

mpadge · August 24, 2020, 10:57am

We now have preliminary drafts of both general standards for statistical software, and specific extensions into a few initial categories - Regression, Bayesian and Monte Carlo, Exploratory Data Analysis, and Time Series. We would very much appreciate any feedback, comments, suggestions, improvements to any and all of the current standards. Everybody is encouraged to peruse the single “master” document in bookdown form, and to provide feedback in increasing orders of formality in one or more of the following ways:

The #stats-peer-review slack channel
The relevant sections of the
discussion forum
- General Standards (here)
- Regression and Supervised Learning Standards
- Exploratory Data Analysis and Summary Statistics Standards
- Bayesian Analysis Standards
- Time Series Standards
The github repository for the “master” document, either via issues for general discussion, or pull requests for more concrete improvements.

Note that we anticipate some degree of “shuffling” between general and category-specific standards, much of which will be deferred until we have developed standards for all of our anticipated categories. There is thus one and only one form of comment for which we are currently not seeking feedback, which is comments regard whether category-specific standard X might be better considered general standard Y - that will be worked out later.

Looking forward to any and all feedback from anybody interested in helping to create our system for peer reviewing statistical software. Without further ado, the relevant standards follow immediately below.

General Standards for Statistical Software

These standards refer to Data Types as the fundamental types defined by the R language itself between the following:

Continuous (numeric)
Integer
String / character
Date/Time
Factor
Ordered Factor

1. Documentation

The following standards describe several forms of what might be considered “Supplementary Material”. While there are many places within an R package where such material may be included, common locations include vignettes, or in additional directories (such as data-raw) listed in .Rbuildignore to prevent inclusion within installed packages.

Where software supports a publication, all claims made in the publication with regard to software performance (for example, claims of algorithmic scaling or efficiency; or claims of accuracy), the following standard applies:

G1.0 Software should include all code necessary to reproduce results which form the basis of performance claims made in associated publications.

Where claims regarding aspects of software performance are made with respect to other extant R packages, the following standard applies:

G1.1 Software should include code necessary to compare performance claims with alternative implementations in other R packages.

2. Input Structures

This section considers general standards for Input Structures. These standards may often effectively be addressed through implementing class structures, although this is not a general requirement. Developers are nevertheless encouraged to examine the guide to S3 vectors in the vctrs package as an example of the kind of assurances and validation checks that are possible with regard to input data. Systems like those demonstrated in that vignette provide a very effective way to ensure that software remains robust to diverse and unexpected classes and types of input data.

2.1 Uni-variate (Vector) Input

It is important to note for univariate data that single values in R are vectors with a length of one, and that 1 is of exactly the same data type as 1:n. Given this, inputs expected to be univariate should:

G2.0 Provide explicit secondary documentation of any expectations on lengths of inputs (generally implying identifying whether an input is expected to be single- or multi-valued)
G2.1 Provide explicit secondary documentation of expectations on data types of all vector inputs (see the above list).
G2.2 Appropriately prohibit or restrict submission of multivariate input to parameters expected to be univariate.
G2.3 Provide appropriate mechanisms to convert between different data types, potentially including:
- G2.3a explicit conversion to integer via as.integer()
- G2.3b explicit conversion to continuous via as.numeric()
- G2.3c explicit conversion to character via as.character() (and not paste or paste0)
- G2.3d explicit conversion to factor via as.factor()
- G2.3e explicit conversion from factor via as...() functions
G2.4 Where inputs are expected to be of factor type, secondary documentation should explicitly state whether these should be ordered or not, and those inputs should provide appropriate error or other routines to ensure inputs follow these expectations.

2.2 Tabular Input

This sub-section concerns input in “tabular data” forms, implying the two primary distinctions within R itself between array or matrix representations, and data.frame and associated representations. Among important differences between these two forms are that array/matrix classes are restricted to storing data of a single uniform type (for example, all integer or all character values), whereas data.frame as associated representations store each column as a list item, allowing different columns to hold values of different types. Further noting that a matrix may, as of R version 4.0, be considered as a strictly two-dimensional array, tabular inputs for the purposes of these standards are considered to imply data represented in one or more of the following forms:

Given this, tabular inputs may be in one or or more of the following forms:

matrix form when referring to specifically two-dimensional data of one uniform type
array form as a more general expression, or when referring to data that are not necessarily or strictly two-dimensional
data.frame
Extensions such as
- tibble
- data.table
- domain-specific classes such as
  tsibble for time series, or
  sf for spatial data.

The term "data.frame and associated forms" is assumed to refer to data represented in either the base::data.frame format, and/or any of the classes listed in the final of the above points.

General Standards applicable to software which is intended to accept any one or more of these tabular inputs are then that:

G2.5 Software should accept as input as many of the above standard tabular forms as possible, including extension to domain-specific forms
G2.6 Software should provide appropriate conversion routines as part of initial pre-processing to ensure that all other sub-functions of a package receive inputs of a single defined class or type.
G2.7 Software should issue diagnostic messages for type conversion in which information is lost (such as conversion of variables from factor to character; standardisation of variable names; or removal of meta-data such as those associated with sf-format data) or added (such as insertion of variable or column names where none were provided).

The next standard concerns the following inconsistencies between three common tabular classes in regard the column extraction operator, [.

class (x) # x is any kind of `data.frame` object
#> [1] "data.frame"
class (x [, 1])
#> [1] "integer"
class (x [, 1, drop = TRUE]) # default
#> [1] "integer"
class (x [, 1, drop = FALSE])
#> [1] "data.frame"

x <- tibble::tibble (x)
class (x [, 1])
#> [1] "tbl_df"     "tbl"        "data.frame"
class (x [, 1, drop = TRUE])
#> [1] "integer"
class (x [, 1, drop = FALSE]) # default
#> [1] "tbl_df"     "tbl"        "data.frame"

x <- data.table::data.table (x)
class (x [, 1])
#> [1] "data.table" "data.frame"
class (x [, 1, drop = TRUE]) # no effect
#> [1] "data.table" "data.frame"
class (x [, 1, drop = FALSE]) # default
#> [1] "data.table" "data.frame"

Extracting a single column from a data.frame returns a vector by default, and a data.frame if drop = FALSE.
Extracting a single column from a tibble returns a single-column tibble by default, and a vector is drop = TRUE.
Extracting a single column from a data.table always returns a data.table, and the drop argument has no effect.

Given such inconsistencies,

G2.8 Software should ensure that extraction or filtering of single columns from tabular inputs should not presume any particular default behaviour, and should ensure all column-extraction operations behave consistently regardless of the class of tabular data used as input.

Adherence to the above standard G2.6 will ensure that any implicitly or explicitly assumed default behaviour will yield consistent results regardless
of input classes.

2.3 Missing or Undefined Values

G2.9 Statistical Software should implement appropriate checks for missing data as part of initial pre-processing prior to passing data to analytic algorithms.
G2.10 Where possible, all functions should provide options for users to specify how to handle missing (NA) data, with options minimally including:
- G2.10a error on missing data
- G2.10b ignore missing data with default warnings or messages issued
- G2.10c replace missing data with appropriately imputed values
G2.11 Functions should never assume non-missingness, and should never pass data with potential missing values to any base routines with default na.rm = FALSE-type parameters (such as mean(), sd() or cor()).
G2.12 All functions should also appropriately handle undefined values (e.g., NaN, Inf and -Inf), including potentially providing options for ignoring or removing such values.

3. Output Structures

G3.1 Statistical Software which enables outputs to be written to local files should parse parameters specifying file names to ensure appropriate file suffices are automatically generated where not provided.

4. Testing

All packages should follow rOpenSci standards on testing and continuous integration, including aiming for high test coverage. Extant R packages which may be useful for testing include testthat, tinytest, roxytest, and xpectr.

G4.0 Where applicable or practicable, tests should use standard data sets with known properties (for example, the NIST Standard Reference Datasets, or data sets provided by other widely-used R packages).
G4.1 Data sets created within, and used to test, a package should be exported (or otherwise made generally available) so that users can confirm tests and run examples.

For testing statistical algorithms, tests should include tests of the following types:

G4.2 Correctness tests to test that statistical algorithms produce expected results to some fixed test data sets (potentially through comparisons using binding frameworks such as RStata).
- G4.2a For new methods, it can be difficult to separate out correctness of the the method from the correctness of the implementation, as there may not be reference for comparison. In this case, testing may be implemented against simple, trivial cases or against multiple implementations such as an initial R implementation compared with results from a C/C++ impelementation.
- G4.2b For new implementations of existing methods, correctness tests should include tests against previous implementations. Such testing may explicitly call those implementations in testing, preferably from fixed-versions of other software, or use stored outputs from those where that is not possible.
- G4.2c Where applicable, stored values may be drawn from published paper outputs when applicable and where code from original implementations is not available
G4.3 Correctness tests should be run with a fixed random seed
G4.4 Parameter recovery tests to test that the implementation produce expected results given data with known properties. For instance, a linear regression algorithm should return expected coefficient values for a simulated data set generated from a linear model.
- G4.4a Parameter recovery tests should generally be expected to succeed within a defined tolerance rather than recovering exact values.
- G4.4b Parameter recovery tests should be run with multiple random seeds when either data simulation or the algorithm contains a random component
G4.5 Algorithm performance tests to test that implementation performs as expected as properties of data change. For instance, a test may show that parameters approach correct estimates within tolerance as data size increases, or that convergence times decrease for higher convergence thresholds.
G4.6 Edge condition tests to test that these conditions produce expected behaviour such as clear warnings or errors when confronted with data with extreme properties including but not limited to:
- G4.6a Zero-length data,
- G4.6b Data of unsupported types (e.g., character or complex numbers in for functions designed only for numeric data)
- G4.6c Data with all-NA fields or columns or all identical fields or columns
- G4.6d Data outside the scope of the algorithm (for example, data with more fields (columns) than observations (rows) for some regression algorithms)
G4.7 Noise susceptibility tests Packages should test for expected stochastic behaviour, such as through the following conditions:
- G4.7a Adding trivial noise (for example, at the scale of .Machine$double.eps) to data does not meaningfully change results
- G4.7b Running under different random seeds or initial conditions does not meaningfully change results

4.1 Extended tests

Thorough testing of statistical software may require tests on large data sets, tests with many permutations, or other conditions leading to long-running tests. In such cases it may be neither possible nor advisable to execute tests continuously, or with every code change. Software should nevertheless test any and all conditions regardless of how long tests may take, and in doing so should adhere to the following standards:

G4.8 Extended tests should included and run under a common framework with other tests but be switched on by flags such as as a <MYPKG>_EXTENDED_TESTS=1 environment variable.
G4.9 Where extended tests require large data sets or other assets, these should be provided for downloading and fetched as part of the testing workflow.
- G4.9a When any downloads of additional data necessary for extended tests fail, the tests themselves should not fail, rather be skipped and implicitly succeed with an appropriate diagnostic message.
G4.10 Any conditions necessary to run extended tests such as platform requirements, memory, expected runtime, and artefacts produced that may need manual inspection, should be described in developer documentation such as a CONTRIBUTING.md file.

Topic		Replies	Views
Statistical Software: Time Series Statistical Software Peer Review time-series	4	1018	June 7, 2021
Statistical Software: Regression and Supervised Learning Statistical Software Peer Review regression , supervised-learning	18	1270	August 24, 2020
Statistical software peer review categories Statistical Software Peer Review	13	2165	June 19, 2020
Statistical Software: Exploratory Data Analysis and Summary Statistics Statistical Software Peer Review eda	7	1142	August 24, 2020
Statistical Software: Bayesian Analyses Statistical Software Peer Review bayesian	8	1481	August 24, 2020