We now have preliminary drafts of both general standards for statistical software, and specific extensions into a few initial categories - Regression, Bayesian and Monte Carlo, Exploratory Data Analysis, and Time Series. We would very much appreciate any feedback, comments, suggestions, improvements to any and all of the current standards. Everybody is encouraged to peruse the single “master” document in bookdown form, and to provide feedback in increasing orders of formality in one or more of the following ways:
- The #stats-peer-review slack channel
- The relevant sections of the
discussion forum
- The github repository for the “master” document, either via issues for general discussion, or pull requests for more concrete improvements.
Note that we anticipate some degree of “shuffling” between general and category-specific standards, much of which will be deferred until we have developed standards for all of our anticipated categories. There is thus one and only one form of comment for which we are currently not seeking feedback, which is comments regard whether category-specific standard X might be better considered general standard Y - that will be worked out later.
Looking forward to any and all feedback from anybody interested in helping to create our system for peer reviewing statistical software. Without further ado, the relevant standards follow immediately below.
Exploratory Data Analysis and Summary Statistics
Exploration is a part of all data analyses, and Exploratory Data Analysis (EDA) is not something that is entered into and exited from at some point prior to “real” analysis. Exploratory Analyses are also not strictly limited to Data, but may extend to exploration of Models of those data. The category could thus equally be termed, “Exploratory Data and Model Analysis”, yet we opt to utilise the standard acronym of EDA in this document.
EDA is nevertheless somewhat different to many other categories included within rOpenSci’s program for peer-reviewing statistical software. Primary differences include:
- EDA software often has a strong focus upon visualization, which is a category which we have otherwise explicitly excluded from the scope of the project at the present stage.
- The assessment of EDA software requires addressing more general questions than software in most other categories, notably including the important question of intended audience(s).
The following standards are accordingly somewhat differently structured than equivalent standards developed to date for other categories, particularly through being more qualitative and abstract. In particular, while documentation is an important component of standards for all categories, clear and instructive documentation is of paramount importance for EDA Software, and so warrants its own sub-section within this document.
1. Documentation Standards
The following refer to Primary Documentation, implying in main package README
or vignette(s), and Secondary Documentation, implying function-level documentation.
The Primary Documentation (README
and/or vignette(s)) of EDA software should:
- EA1.0 Identify one or more target audiences for whom the software is intended
- EA1.1 Identify the kinds of data the software is capable of analysing (see Kinds of Data below).
- EA1.2 Identify the kinds of questions the software is intended to help explore; for example, are these questions:
- inferential?
- predictive?
- associative?
- causal?
- (or other modes of statistical enquiry?)
The Secondary Documentation (within individual functions) of EDA software should:
- EA1.3 Identify the kinds of data each function is intended to accept as input
2. Input Data
A further primary difference of EDA software from that of our other categories is that input data for statistical software may be generally presumed of one or more specific types, whereas EDA software often accepts data of more general and varied types. EDA software should aim to accept and appropriately transform as many diverse kinds of input data as possible, through addressing the following standards, considered in terms of the two cases of input data in uni- and multi-variate form. All of the general standards for kinds of input (G2.0 - G2.7) apply to input data for EDA Software.
2.1 Index Columns
The following standards refer to an index column, which is understood to imply an explicitly named or identified column which can be used to provide a unique index index into any and all rows of that table. Index columns ensure the universal applicability of standard table join operations, such as those implemented via the dplyr
package.
- EA2.1 EDA Software which accepts standard rectangular data and implements or relies upon extensive table filter and join operations should utilise an index column system
- EA2.2 All values in an index column must be unique, and this uniqueness should be affirmed as a pre-processing step for all input data.
- EA2.3 Index columns should be explicitly identified, either:
- EA2.3a by using an appropriate class system, or
- EA2.3b through setting an
attribute
on a table, x
, of attr(x, "index") <- <index_col_name>
.
For EDA software which either implements custom classes or explicitly sets attributes specifying index columns, these attributes should be used as the basis of all table join operations, and in particular:
- EA2.4 Table join operations should not be based on any assumed variable or column names
2.2 Multi-tabular input
EDA software designed to accept multi-tabular input should:
- EA2.5 Use and demand an explicit class system for such input (for example, via the
DM
package).
- EA2.6 Ensure all individual tables follow the above standards for Index Columns
2.3 Classes and Sub-Classes
Classes are understood here to be the classes define single input objects, while Sub-Classes refer to the class definitions of components of input objects (for example, of columns of an input data.frame
). EDA software which is intended to receive input in general vector formats (see Uni-variate Input section of General Standards) should ensure:
- EA2.7 Routines appropriately process vector input of custom classes, including those which do not inherit from the
vector
class
- EA2.8 Routines should appropriately process vector data regardless of additional attributes
The following code illustrates some ways by which “metadata” defining classes and additional attributes associated with a standard vector object may by modified.
x <- 1:10
class (x) <- "notvector"
attr (x, "extra_attribute") <- "another attribute"
attr (x, "vector attribute") <- runif (5)
attributes (x)
#> $class
#> [1] "notvector"
#>
#> $extra_attribute
#> [1] "another attribute"
#>
#> $`vector attribute`
#> [1] 0.03521663 0.49418081 0.60129563 0.75804346 0.16073301
All statistical software should appropriately deal with such input data, as exemplified by the storage.mode()
, length()
, and sum()
functions of the base
package, which return the appropriate values regardless of redefinition of class or additional attributes.
storage.mode (x)
#> [1] "integer"
length (x)
#> [1] 10
sum (x)
#> [1] 55
storage.mode (sum (x))
#> [1] "integer"
Rectangular inputs in data.frame
class may contain columns which are themselves defined by custom classes, and which possess additional attributes. EDA Software which accepts rectangular inputs should accordingly ensure:
- EA2.9 EDA routines appropriately process rectangular input of custom classes, ideally by means of a single pre-processing routine which converts rectangular input to some standard form subsequently passed to all analytic routines.
- EA2.10 EDA routines accept and appropriately process rectangular input in which individual columns may be of custom sub-classes including additional attributes.
3. Analytic Algorithms
(There are no specific standards for analytic algorithms in EDA Software.)
4. Return Results / Output Data
- EA4.1 EDA Software should ensure all return results have types which are consistent with input types. For example,
sum
, min
, or max
values applied to integer
-type vectors should return integer
values, while mean
or var
will generally return numeric
types.
- EA4.2 EDA Software should implement parameters to enable explicit control of numeric precision
- EA4.3 The primary routines of EDA Software should return objects for which default
print
and plot
methods give sensible results. Default summary
methods may also be implemented.
5. Visualization and Summary Output
Visualization commonly represents one of the primary functions of EDA Software, and thus visualization output is given greater consideration in this category than in other categories in which visualization may nevertheless play an important role. In particular, one component of this sub-category is Summary Output, taken to refer to all forms of screen-based output beyond conventional graphical output, including tabular and other text-based forms. Standards for visualization itself are considered in the two primary sub-categories of static and dynamic visualization, where the latter includes interactive visualization.
Prior to these individual sub-categories, we consider a few standards applicable to visualization in general, whether static or dynamic.
- EA5.1 Graphical presentation in EDA software should be as accessible as possible or practicable. In particular, EDA software should consider accessibility in terms of:
- EA5.1a Typeface sizes should default to sizes which explicitly enhance accessibility
- EA5.1b Default colour schemes should be carefully constructed to ensure accessibility.
- EA5.2 Any explicit specifications of typefaces which override default values should consider accessibility
5.1 Summary and Screen-based Output
- EA5.3 Screen-based output should never rely on default print formatting of
numeric
types, rather should also use some version of round(., digits)
, formatC
, sprintf
, or similar functions for numeric formatting according the parameter described in EDA4.2.
- EA5.4 Column-based summary statistics should always indicate the
storage.mode
, class
, or equivalent defining attribute of each column (as, for example, implemented in the default print.tibble
method).
5.2 General Standards for Visualization (Static and Dynamic)
- EA5.5 All visualisations should include units on all axes, with sensibly rounded values (for example, as produced by the
pretty()
function).
5.3 Dynamic Visualization
Dynamic visualization routines are commonly implemented as interfaces to javascript
routines. Unless routines have been explicitly developed as an internal part of an R package, standards shall not be considered to apply to the code itself, rather only to decisions present as user-controlled parameters exposed within the R environment. That said, one standard may nevertheless be applied, with an aim to minimise
- EA5.6 Any packages which internally bundle libraries used for dynamic visualization and which are also bundled in other, pre-existing R packages, should explain the necessity and advantage of re-bundling that library.