Statistical Software: Exploratory Data Analysis and Summary Statistics

Tags: #<Tag:0x00007fc0096aeae8>

Hi all! This thread is for discussion of standards for peer review of Exploratory Data Analysis (EDA) and Summary Statistics packages, one of our initial statistical package categories. We’re interested in the community’s input as to what standards should apply specifically to such packages.

EDA/Summary software captures and presents statistical representations of data and inter-relationships. This may be transforming input data into summary output data in some novel form, including standard summary values specific to a field. EDA/Summary software also generally aids the understanding of data properties or distributions via quantitative procedures, commonly aided by visualisation tools. Note that categories or non-exclusive; EDA/Summary packages may also be, for instance, Time Series packages. We do however, distinguish this category from unsupervised learning, clustering, or dimensionality-reducing packages, which we will address later.

We’d like your answers to these questions:

  • What should a EDA/Summary packages do so as to pass a peer review process?
  • How should EDA/Summary software do that?
  • What should be documented in EDA/Summary packages?
  • How should EDA/Summary software be tested?

Don’t worry if your contributions might apply to other categories. We will look at suggestions across topics to see which should be elevated to more general standards. Please do comment on whether you think a standard should be required or recommended for a package to pass peer-review. We’ll use these comments as input for prioritizing standards in the future.

For some useful reference, see our background chapter. Feel free to drop other references in the thread!


Thinking about this in terms of data used to test out EDA packages, here are some brief thoughts on what data to use for EDA, to test out EDA software:

Perhaps requiring that datasets contain some mix of these types of data, to understand their strengths/weaknesses?

1 Like

I think this will likely be one of the hardest categories, but here are a few thoughts that I had.

Let’s imagine a 4 dimensional space with the following dimensions:

  1. The type of EDA software roughly categorized into: (i) visualization only, (ii) non-visualization only, or (iii) both viz and non-viz software.
  2. The type of question being asked about the data being explored (predictive, inferential, associative, causal, etc)
  3. The data type (e.g. count data, time series, continuous data, genomic data, etc)
  4. The audience for whom the EDA software is being designed for (e.g. a financial analyst without much R training, students in primary school, or experienced R developer, etc)

I think the answers to your questions @mpadge from your post on May 29 are going to depend on the answering the questions above first.

For example, EDA software that has no visualization component (e.g. only identifies if there is missing data in your dataset), it will likely need a different set of standards and/or tests than EDA software that only contains data viz functionality.

I would suggest a questionnaire (with the 4 questions above) be provided to the developer of the EDA software package being submitted for peer review. This would be submitted as part of the peer review process. Then, the person doing the peer review might have a tailored list of standards/tests for each position along this 4 dimensional space. I’m probably forgetting other important dimensions and would welcome feedback on this though.

Finally, I’ll say that one important set of standards / test that I think can be automatized for EDA software is checking about the accessibility of the EDA software. For example, if the EDA software produces data visualizations, a standard should be make sure that the colors are accessible for individuals with a color vision deficiency. Here are some other things we might think about too. Tools | 18F Accessibility Guide

Here is a post about how to simulate color vision deficiency using ggplots, which would be helpful for developers:

1 Like

Thanks for your insights, Stephanie!

Regarding this:

I’m a little wary of over-complicating the process in this way, but I think the idea could be met by making this part of the documentation standards. This has the benefit of making these answers clear to the user as well as reviewers the role of the EDA package. Also, by putting it into the standard, hopefully it drives authors to consider these ideas during development rather than submission.

For instance, we could include the following in the standard:

  • Package-level documentation (README, website, package help file, etc). should clearly state:
    • The type of question addressed by the EDA method: prediction, inference, association, causation, etc.,
    • The data types to which the method is designed to apply to. This include both structural aspects (e.g., count data, time series, continuous) and topical (e.g., genomic, limnological).
    • The expected audience for the package and package outputs (e.g. a financial analyst without much R training, students in primary school, or experienced R developer, etc)
  • If these vary across different functions in the package this information may be partially contained in function-level documentation.

Assuming that it is simple to determine whether a package has visualization, we might include the following:

  • If the package includes visualization as part of EDA
    • Documentation should make clear whether visualization is the primary EDA method (e.g. grand tour animations, trelliscope.js), or one output of a numerical EDA routine (e.g., bar charts of group-level summary statistics).
    • Visualizations should conform to the following accessibility standards…

Sounds good, thanks @noamross!

taking in consideration that i’m new and still learning, there seems to be a lot of useful information for me. appreciating it a lot. i was also wondering if i could ask you some more questions, if you don’t mind, noamross? thanks

1 Like

@sangeval Absolutely - that is what these threads are intended for, so please feel free to ask all the questions you like.

1 Like