We now have preliminary drafts of both general standards for statistical software, and specific extensions into a few initial categories - Regression, Bayesian and Monte Carlo, Exploratory Data Analysis, and Time Series. We would very much appreciate any feedback, comments, suggestions, improvements to any and all of the current standards. Everybody is encouraged to peruse the single “master” document in bookdown form, and to provide feedback in increasing orders of formality in one or more of the following ways:
- The #stats-peer-review slack channel
- The relevant sections of the
- The github repository for the “master” document, either via issues for general discussion, or pull requests for more concrete improvements.
Note that we anticipate some degree of “shuffling” between general and category-specific standards, much of which will be deferred until we have developed standards for all of our anticipated categories. There is thus one and only one form of comment for which we are currently not seeking feedback, which is comments regard whether category-specific standard X might be better considered general standard Y - that will be worked out later.
Looking forward to any and all feedback from anybody interested in helping to create our system for peer reviewing statistical software. Without further ado, the relevant standards follow immediately below.
General Standards for Statistical Software
These standards refer to Data Types as the fundamental types defined by the R language itself between the following:
- Continuous (numeric)
- String / character
- Ordered Factor
The following standards describe several forms of what might be considered “Supplementary Material”. While there are many places within an R package where such material may be included, common locations include vignettes, or in additional directories (such as
data-raw) listed in
.Rbuildignore to prevent inclusion within installed packages.
Where software supports a publication, all claims made in the publication with regard to software performance (for example, claims of algorithmic scaling or efficiency; or claims of accuracy), the following standard applies:
- G1.0 Software should include all code necessary to reproduce results which form the basis of performance claims made in associated publications.
Where claims regarding aspects of software performance are made with respect to other extant R packages, the following standard applies:
- G1.1 Software should include code necessary to compare performance claims with alternative implementations in other R packages.
2. Input Structures
This section considers general standards for Input Structures. These standards may often effectively be addressed through implementing class structures, although this is not a general requirement. Developers are nevertheless encouraged to examine the guide to S3 vectors in the
vctrs package as an example of the kind of assurances and validation checks that are possible with regard to input data. Systems like those demonstrated in that vignette provide a very effective way to ensure that software remains robust to diverse and unexpected classes and types of input data.
2.1 Uni-variate (Vector) Input
It is important to note for univariate data that single values in R are vectors with a length of one, and that
1 is of exactly the same data type as
1:n. Given this, inputs expected to be univariate should:
- G2.0 Provide explicit secondary documentation of any expectations on lengths of inputs (generally implying identifying whether an input is expected to be single- or multi-valued)
- G2.1 Provide explicit secondary documentation of expectations on data types of all vector inputs (see the above list).
- G2.2 Appropriately prohibit or restrict submission of multivariate input to parameters expected to be univariate.
G2.3 Provide appropriate mechanisms to convert between different data types, potentially including:
G2.3a explicit conversion to
G2.3b explicit conversion to continuous via
G2.3c explicit conversion to character via
G2.3d explicit conversion to factor via
G2.3e explicit conversion from factor via
- G2.3a explicit conversion to
G2.4 Where inputs are expected to be of
factortype, secondary documentation should explicitly state whether these should be
orderedor not, and those inputs should provide appropriate error or other routines to ensure inputs follow these expectations.
2.2 Tabular Input
This sub-section concerns input in “tabular data” forms, implying the two primary distinctions within R itself between
matrix representations, and
data.frame and associated representations. Among important differences between these two forms are that
matrix classes are restricted to storing data of a single uniform type (for example, all
integer or all
character values), whereas
data.frame as associated representations store each column as a list item, allowing different columns to hold values of different types. Further noting that a
matrix may, as of R version 4.0, be considered as a strictly two-dimensional array, tabular inputs for the purposes of these standards are considered to imply data represented in one or more of the following forms:
Given this, tabular inputs may be in one or or more of the following forms:
matrixform when referring to specifically two-dimensional data of one uniform type
arrayform as a more general expression, or when referring to data that are not necessarily or strictly two-dimensional
- Extensions such as
The term "
data.frame and associated forms" is assumed to refer to data represented in either the
base::data.frame format, and/or any of the classes listed in the final of the above points.
General Standards applicable to software which is intended to accept any one or more of these tabular inputs are then that:
- G2.5 Software should accept as input as many of the above standard tabular forms as possible, including extension to domain-specific forms
- G2.6 Software should provide appropriate conversion routines as part of initial pre-processing to ensure that all other sub-functions of a package receive inputs of a single defined class or type.
G2.7 Software should issue diagnostic messages for type conversion in which information is lost (such as conversion of variables from factor to character; standardisation of variable names; or removal of meta-data such as those associated with
sf-format data) or added (such as insertion of variable or column names where none were provided).
The next standard concerns the following inconsistencies between three common tabular classes in regard the column extraction operator,
class (x) # x is any kind of `data.frame` object #>  "data.frame" class (x [, 1]) #>  "integer" class (x [, 1, drop = TRUE]) # default #>  "integer" class (x [, 1, drop = FALSE]) #>  "data.frame" x <- tibble::tibble (x) class (x [, 1]) #>  "tbl_df" "tbl" "data.frame" class (x [, 1, drop = TRUE]) #>  "integer" class (x [, 1, drop = FALSE]) # default #>  "tbl_df" "tbl" "data.frame" x <- data.table::data.table (x) class (x [, 1]) #>  "data.table" "data.frame" class (x [, 1, drop = TRUE]) # no effect #>  "data.table" "data.frame" class (x [, 1, drop = FALSE]) # default #>  "data.table" "data.frame"
- Extracting a single column from a
vectorby default, and a
drop = FALSE.
- Extracting a single column from a
tibblereturns a single-column
tibbleby default, and a
drop = TRUE.
- Extracting a single column from a
data.tablealways returns a
data.table, and the
dropargument has no effect.
Given such inconsistencies,
- G2.8 Software should ensure that extraction or filtering of single columns from tabular inputs should not presume any particular default behaviour, and should ensure all column-extraction operations behave consistently regardless of the class of tabular data used as input.
Adherence to the above standard G2.6 will ensure that any implicitly or explicitly assumed default behaviour will yield consistent results regardless
of input classes.
2.3 Missing or Undefined Values
- G2.9 Statistical Software should implement appropriate checks for missing data as part of initial pre-processing prior to passing data to analytic algorithms.
G2.10 Where possible, all functions should provide options for users to specify how to handle missing (
NA) data, with options minimally including:
- G2.10a error on missing data
- G2.10b ignore missing data with default warnings or messages issued
- G2.10c replace missing data with appropriately imputed values
G2.11 Functions should never assume non-missingness, and should never pass data with potential missing values to any base routines with default
na.rm = FALSE-type parameters (such as
G2.12 All functions should also appropriately handle undefined values (e.g.,
-Inf), including potentially providing options for ignoring or removing such values.
3. Output Structures
- G3.1 Statistical Software which enables outputs to be written to local files should parse parameters specifying file names to ensure appropriate file suffices are automatically generated where not provided.
All packages should follow rOpenSci standards on testing and continuous integration, including aiming for high test coverage. Extant R packages which may be useful for testing include
- G4.0 Where applicable or practicable, tests should use standard data sets with known properties (for example, the NIST Standard Reference Datasets, or data sets provided by other widely-used R packages).
- G4.1 Data sets created within, and used to test, a package should be exported (or otherwise made generally available) so that users can confirm tests and run examples.
For testing statistical algorithms, tests should include tests of the following types:
G4.2 Correctness tests to test that statistical algorithms produce expected results to some fixed test data sets (potentially through comparisons using binding frameworks such as RStata).
- G4.2a For new methods, it can be difficult to separate out correctness of the the method from the correctness of the implementation, as there may not be reference for comparison. In this case, testing may be implemented against simple, trivial cases or against multiple implementations such as an initial R implementation compared with results from a C/C++ impelementation.
- G4.2b For new implementations of existing methods, correctness tests should include tests against previous implementations. Such testing may explicitly call those implementations in testing, preferably from fixed-versions of other software, or use stored outputs from those where that is not possible.
- G4.2c Where applicable, stored values may be drawn from published paper outputs when applicable and where code from original implementations is not available
- G4.3 Correctness tests should be run with a fixed random seed
G4.4 Parameter recovery tests to test that the implementation produce expected results given data with known properties. For instance, a linear regression algorithm should return expected coefficient values for a simulated data set generated from a linear model.
- G4.4a Parameter recovery tests should generally be expected to succeed within a defined tolerance rather than recovering exact values.
- G4.4b Parameter recovery tests should be run with multiple random seeds when either data simulation or the algorithm contains a random component
- G4.5 Algorithm performance tests to test that implementation performs as expected as properties of data change. For instance, a test may show that parameters approach correct estimates within tolerance as data size increases, or that convergence times decrease for higher convergence thresholds.
G4.6 Edge condition tests to test that these conditions produce expected behaviour such as clear warnings or errors when confronted with data with extreme properties including but not limited to:
- G4.6a Zero-length data,
- G4.6b Data of unsupported types (e.g., character or complex numbers in for functions designed only for numeric data)
G4.6c Data with all-
NAfields or columns or all identical fields or columns
- G4.6d Data outside the scope of the algorithm (for example, data with more fields (columns) than observations (rows) for some regression algorithms)
G4.7 Noise susceptibility tests Packages should test for expected stochastic behaviour, such as through the following conditions:
G4.7a Adding trivial noise (for example, at the scale of
.Machine$double.eps) to data does not meaningfully change results
- G4.7b Running under different random seeds or initial conditions does not meaningfully change results
- G4.7a Adding trivial noise (for example, at the scale of
4.1 Extended tests
Thorough testing of statistical software may require tests on large data sets, tests with many permutations, or other conditions leading to long-running tests. In such cases it may be neither possible nor advisable to execute tests continuously, or with every code change. Software should nevertheless test any and all conditions regardless of how long tests may take, and in doing so should adhere to the following standards:
G4.8 Extended tests should included and run under a common framework with other tests but be switched on by flags such as as a
G4.9 Where extended tests require large data sets or other assets, these should be provided for downloading and fetched as part of the testing workflow.
- G4.9a When any downloads of additional data necessary for extended tests fail, the tests themselves should not fail, rather be skipped and implicitly succeed with an appropriate diagnostic message.
G4.10 Any conditions necessary to run extended tests such as platform requirements, memory, expected runtime, and artefacts produced that may need manual inspection, should be described in developer documentation such as a