Statistical Software: Exploratory Data Analysis and Summary Statistics

Tags: #<Tag:0x00007fe2271dddf8>

Hi all! This thread is for discussion of standards for peer review of Exploratory Data Analysis (EDA) and Summary Statistics packages, one of our initial statistical package categories. We’re interested in the community’s input as to what standards should apply specifically to such packages.

EDA/Summary software captures and presents statistical representations of data and inter-relationships. This may be transforming input data into summary output data in some novel form, including standard summary values specific to a field. EDA/Summary software also generally aids the understanding of data properties or distributions via quantitative procedures, commonly aided by visualisation tools. Note that categories or non-exclusive; EDA/Summary packages may also be, for instance, Time Series packages. We do however, distinguish this category from unsupervised learning, clustering, or dimensionality-reducing packages, which we will address later.

We’d like your answers to these questions:

  • What should a EDA/Summary packages do so as to pass a peer review process?
  • How should EDA/Summary software do that?
  • What should be documented in EDA/Summary packages?
  • How should EDA/Summary software be tested?

Don’t worry if your contributions might apply to other categories. We will look at suggestions across topics to see which should be elevated to more general standards. Please do comment on whether you think a standard should be required or recommended for a package to pass peer-review. We’ll use these comments as input for prioritizing standards in the future.

For some useful reference, see our background chapter. Feel free to drop other references in the thread!


Thinking about this in terms of data used to test out EDA packages, here are some brief thoughts on what data to use for EDA, to test out EDA software:

Perhaps requiring that datasets contain some mix of these types of data, to understand their strengths/weaknesses?

1 Like

I think this will likely be one of the hardest categories, but here are a few thoughts that I had.

Let’s imagine a 4 dimensional space with the following dimensions:

  1. The type of EDA software roughly categorized into: (i) visualization only, (ii) non-visualization only, or (iii) both viz and non-viz software.
  2. The type of question being asked about the data being explored (predictive, inferential, associative, causal, etc)
  3. The data type (e.g. count data, time series, continuous data, genomic data, etc)
  4. The audience for whom the EDA software is being designed for (e.g. a financial analyst without much R training, students in primary school, or experienced R developer, etc)

I think the answers to your questions @mpadge from your post on May 29 are going to depend on the answering the questions above first.

For example, EDA software that has no visualization component (e.g. only identifies if there is missing data in your dataset), it will likely need a different set of standards and/or tests than EDA software that only contains data viz functionality.

I would suggest a questionnaire (with the 4 questions above) be provided to the developer of the EDA software package being submitted for peer review. This would be submitted as part of the peer review process. Then, the person doing the peer review might have a tailored list of standards/tests for each position along this 4 dimensional space. I’m probably forgetting other important dimensions and would welcome feedback on this though.

Finally, I’ll say that one important set of standards / test that I think can be automatized for EDA software is checking about the accessibility of the EDA software. For example, if the EDA software produces data visualizations, a standard should be make sure that the colors are accessible for individuals with a color vision deficiency. Here are some other things we might think about too. Tools | 18F Accessibility Guide

Here is a post about how to simulate color vision deficiency using ggplots, which would be helpful for developers:

1 Like

Thanks for your insights, Stephanie!

Regarding this:

I’m a little wary of over-complicating the process in this way, but I think the idea could be met by making this part of the documentation standards. This has the benefit of making these answers clear to the user as well as reviewers the role of the EDA package. Also, by putting it into the standard, hopefully it drives authors to consider these ideas during development rather than submission.

For instance, we could include the following in the standard:

  • Package-level documentation (README, website, package help file, etc). should clearly state:
    • The type of question addressed by the EDA method: prediction, inference, association, causation, etc.,
    • The data types to which the method is designed to apply to. This include both structural aspects (e.g., count data, time series, continuous) and topical (e.g., genomic, limnological).
    • The expected audience for the package and package outputs (e.g. a financial analyst without much R training, students in primary school, or experienced R developer, etc)
  • If these vary across different functions in the package this information may be partially contained in function-level documentation.

Assuming that it is simple to determine whether a package has visualization, we might include the following:

  • If the package includes visualization as part of EDA
    • Documentation should make clear whether visualization is the primary EDA method (e.g. grand tour animations, trelliscope.js), or one output of a numerical EDA routine (e.g., bar charts of group-level summary statistics).
    • Visualizations should conform to the following accessibility standards…

Sounds good, thanks @noamross!

taking in consideration that i’m new and still learning, there seems to be a lot of useful information for me. appreciating it a lot. i was also wondering if i could ask you some more questions, if you don’t mind, noamross? thanks

1 Like

@sangeval Absolutely - that is what these threads are intended for, so please feel free to ask all the questions you like.

1 Like

We now have preliminary drafts of both general standards for statistical software, and specific extensions into a few initial categories - Regression, Bayesian and Monte Carlo, Exploratory Data Analysis, and Time Series. We would very much appreciate any feedback, comments, suggestions, improvements to any and all of the current standards. Everybody is encouraged to peruse the single “master” document in bookdown form, and to provide feedback in increasing orders of formality in one or more of the following ways:

  1. The #stats-peer-review slack channel
  2. The relevant sections of the
    discussion forum
  3. The github repository for the “master” document, either via issues for general discussion, or pull requests for more concrete improvements.

Note that we anticipate some degree of “shuffling” between general and category-specific standards, much of which will be deferred until we have developed standards for all of our anticipated categories. There is thus one and only one form of comment for which we are currently not seeking feedback, which is comments regard whether category-specific standard X might be better considered general standard Y - that will be worked out later.

Looking forward to any and all feedback from anybody interested in helping to create our system for peer reviewing statistical software. Without further ado, the relevant standards follow immediately below.

Exploratory Data Analysis and Summary Statistics

Exploration is a part of all data analyses, and Exploratory Data Analysis (EDA) is not something that is entered into and exited from at some point prior to “real” analysis. Exploratory Analyses are also not strictly limited to Data, but may extend to exploration of Models of those data. The category could thus equally be termed, “Exploratory Data and Model Analysis”, yet we opt to utilise the standard acronym of EDA in this document.

EDA is nevertheless somewhat different to many other categories included within rOpenSci’s program for peer-reviewing statistical software. Primary differences include:

  • EDA software often has a strong focus upon visualization, which is a category which we have otherwise explicitly excluded from the scope of the project at the present stage.
  • The assessment of EDA software requires addressing more general questions than software in most other categories, notably including the important question of intended audience(s).

The following standards are accordingly somewhat differently structured than equivalent standards developed to date for other categories, particularly through being more qualitative and abstract. In particular, while documentation is an important component of standards for all categories, clear and instructive documentation is of paramount importance for EDA Software, and so warrants its own sub-section within this document.

1. Documentation Standards

The following refer to Primary Documentation, implying in main package README or vignette(s), and Secondary Documentation, implying function-level documentation.

The Primary Documentation (README and/or vignette(s)) of EDA software should:

  • EA1.0 Identify one or more target audiences for whom the software is intended
  • EA1.1 Identify the kinds of data the software is capable of analysing (see Kinds of Data below).
  • EA1.2 Identify the kinds of questions the software is intended to help explore; for example, are these questions:
    • inferential?
    • predictive?
    • associative?
    • causal?
    • (or other modes of statistical enquiry?)

The Secondary Documentation (within individual functions) of EDA software should:

  • EA1.3 Identify the kinds of data each function is intended to accept as input

2. Input Data

A further primary difference of EDA software from that of our other categories is that input data for statistical software may be generally presumed of one or more specific types, whereas EDA software often accepts data of more general and varied types. EDA software should aim to accept and appropriately transform as many diverse kinds of input data as possible, through addressing the following standards, considered in terms of the two cases of input data in uni- and multi-variate form. All of the general standards for kinds of input (G2.0 - G2.7) apply to input data for EDA Software.

2.1 Index Columns

The following standards refer to an index column, which is understood to imply an explicitly named or identified column which can be used to provide a unique index index into any and all rows of that table. Index columns ensure the universal applicability of standard table join operations, such as those implemented via the dplyr package.

  • EA2.1 EDA Software which accepts standard rectangular data and implements or relies upon extensive table filter and join operations should utilise an index column system
  • EA2.2 All values in an index column must be unique, and this uniqueness should be affirmed as a pre-processing step for all input data.
  • EA2.3 Index columns should be explicitly identified, either:
    • EA2.3a by using an appropriate class system, or
    • EA2.3b through setting an attribute on a table, x, of attr(x, "index") <- <index_col_name>.

For EDA software which either implements custom classes or explicitly sets attributes specifying index columns, these attributes should be used as the basis of all table join operations, and in particular:

  • EA2.4 Table join operations should not be based on any assumed variable or column names

2.2 Multi-tabular input

EDA software designed to accept multi-tabular input should:

  • EA2.5 Use and demand an explicit class system for such input (for example, via the DM package).
  • EA2.6 Ensure all individual tables follow the above standards for Index Columns

2.3 Classes and Sub-Classes

Classes are understood here to be the classes define single input objects, while Sub-Classes refer to the class definitions of components of input objects (for example, of columns of an input data.frame). EDA software which is intended to receive input in general vector formats (see Uni-variate Input section of General Standards) should ensure:

  • EA2.7 Routines appropriately process vector input of custom classes, including those which do not inherit from the vector class
  • EA2.8 Routines should appropriately process vector data regardless of additional attributes

The following code illustrates some ways by which “metadata” defining classes and additional attributes associated with a standard vector object may by modified.

x <- 1:10
class (x) <- "notvector"
attr (x, "extra_attribute") <- "another attribute"
attr (x, "vector attribute") <- runif (5)
attributes (x)
#> $class
#> [1] "notvector"
#> $extra_attribute
#> [1] "another attribute"
#> $`vector attribute`
#> [1] 0.03521663 0.49418081 0.60129563 0.75804346 0.16073301

All statistical software should appropriately deal with such input data, as exemplified by the storage.mode(), length(), and sum() functions of the base package, which return the appropriate values regardless of redefinition of class or additional attributes.

storage.mode (x)
#> [1] "integer"
length (x)
#> [1] 10
sum (x)
#> [1] 55
storage.mode (sum (x))
#> [1] "integer"

Rectangular inputs in data.frame class may contain columns which are themselves defined by custom classes, and which possess additional attributes. EDA Software which accepts rectangular inputs should accordingly ensure:

  • EA2.9 EDA routines appropriately process rectangular input of custom classes, ideally by means of a single pre-processing routine which converts rectangular input to some standard form subsequently passed to all analytic routines.
  • EA2.10 EDA routines accept and appropriately process rectangular input in which individual columns may be of custom sub-classes including additional attributes.

3. Analytic Algorithms

(There are no specific standards for analytic algorithms in EDA Software.)

4. Return Results / Output Data

  • EA4.1 EDA Software should ensure all return results have types which are consistent with input types. For example, sum, min, or max values applied to integer-type vectors should return integer values, while mean or var will generally return numeric types.
  • EA4.2 EDA Software should implement parameters to enable explicit control of numeric precision
  • EA4.3 The primary routines of EDA Software should return objects for which default print and plot methods give sensible results. Default summary methods may also be implemented.

5. Visualization and Summary Output

Visualization commonly represents one of the primary functions of EDA Software, and thus visualization output is given greater consideration in this category than in other categories in which visualization may nevertheless play an important role. In particular, one component of this sub-category is Summary Output, taken to refer to all forms of screen-based output beyond conventional graphical output, including tabular and other text-based forms. Standards for visualization itself are considered in the two primary sub-categories of static and dynamic visualization, where the latter includes interactive visualization.

Prior to these individual sub-categories, we consider a few standards applicable to visualization in general, whether static or dynamic.

  • EA5.1 Graphical presentation in EDA software should be as accessible as possible or practicable. In particular, EDA software should consider accessibility in terms of:
    • EA5.1a Typeface sizes should default to sizes which explicitly enhance accessibility
    • EA5.1b Default colour schemes should be carefully constructed to ensure accessibility.
  • EA5.2 Any explicit specifications of typefaces which override default values should consider accessibility

5.1 Summary and Screen-based Output

  • EA5.3 Screen-based output should never rely on default print formatting of numeric types, rather should also use some version of round(., digits), formatC, sprintf, or similar functions for numeric formatting according the parameter described in EDA4.2.
  • EA5.4 Column-based summary statistics should always indicate the storage.mode, class, or equivalent defining attribute of each column (as, for example, implemented in the default print.tibble method).

5.2 General Standards for Visualization (Static and Dynamic)

  • EA5.5 All visualisations should include units on all axes, with sensibly rounded values (for example, as produced by the pretty() function).

5.3 Dynamic Visualization

Dynamic visualization routines are commonly implemented as interfaces to javascript routines. Unless routines have been explicitly developed as an internal part of an R package, standards shall not be considered to apply to the code itself, rather only to decisions present as user-controlled parameters exposed within the R environment. That said, one standard may nevertheless be applied, with an aim to minimise

  • EA5.6 Any packages which internally bundle libraries used for dynamic visualization and which are also bundled in other, pre-existing R packages, should explain the necessity and advantage of re-bundling that library.