Statistical Software: Time Series

mpadge · May 29, 2020, 8:55pm

Hi all! This thread is for discussion of standards for peer review of Time Series packages, one of our initial statistical package categories. We’re interested in the community’s input as to what standards should apply specifically to such packages.

Time Series packages include or use explicit classes for representing and processing time-series data, and/or implements algorithms for manipulating and modeling such data. Note that categories or non-exclusive; time-series packages may also be, for instance, Bayesian or Exploratory Data Analysis packages.

We’d like your answers to these questions:

What should a time series packages do so as to pass a peer review process?
How should time series software do that?
What should be documented in time series packages?
How should time series packages be tested?

Don’t worry if your contributions might apply to other categories. We will look at suggestions across topics to see which should be elevated to more general standards. Please do comment on whether you think a standard should be required or recommended for a package to pass peer-review. We’ll use these comments as input for prioritizing standards in the future.

For some useful reference, see our background chapter. For examples see also the CRAN Task View on Time Series Analysis and the Tidy Forecasting Principles guide. Feel free to drop other references in the thread!

mpadge · June 17, 2020, 8:06am

Some links shared by Rebecca Killick for data useful for testing & assessing time-series software:

Time Series Data - International Institute of Forecasters

Mcompetitions (M Forecasting Competitions) · GitHub

mpadge · June 30, 2020, 11:19am

Provisional draft of standards for time-series software - please comment! All those interested are also invited to edit these standards on this hackmd.io document. Note that the following list is intended to illustrate the nature of standards we envision implement, and is not exhaustive. We are particularly interested in hearing opinions regarding aspects which we may have missed here.

Time Series Software

Standards

Many of the following standards are written with reference to how software should function. Such aspects can and should often also be tested. Where testing of described functionality is expected, a “(TEST)” is added to the description.

Standards regarding documentation imply doing so at appropriate places within
the software; either within functions themselves, within extended Vignettes,
within the main README document of a package, or elsewhere.

Class Systems

Time Series Software should use or implement explicit class systems
Time Series Software may extend common class systems for time series; see the section “Time Series Classes” in the CRAN Task view on Time Series Analysis".
Class systems should require units (unless justified otherwise), such as those offered by the units, anytime, or lubridate packages. (Note that the stats::ts class does not directly support specification of units.) (TEST).
Where units are used, class systems should work with units provided by as many of the above packages and unit systems as possible (TEST).
Where time intervals or periods are admitted, and where these may be months or years, software should be explicit about the system used to represent such, particularly regarding whether a calendar system is used, or whether a year is presumed to have 365 days, 365.2422 days, or some other value (TEST).
Where covariance matrices are returned from functions, these should also use a class system, and potentially also include specification of appropriate units (TEST).

A Class System should:

Ensure strict ordering of the time, frequency, or equivalent ordering index variable (TEST).
Catch any violations of ordering in the pre-processing stages of all functions (TEST).
Where covariance matrices are generated or used, ensure that ordering or rows and columns is maintained and/or not able to be violated (TEST).

Stationarity

Time Series Software should explicitly document assumptions or requirements
made with respect to the stationarity or otherwise of all input data. In
particular, any (sub-)functions which assume or rely on stationarity should:

Consider stationarity of all relevant moments - typically first (mean) and second (variance) order (TEST), or otherwise document why such consideration may be restricted to lower orders only.
implement appropriate checks for such (TEST);
either
- issue diagnostic messages or warnings (TEST); or
- enable or advise on appropriate transformations to ensure stationarity (TEST).

Missing Values

Time Series Software should deal appropriately with missing values.

All functions which accept time series as input data should perform appropriate checks and associated steps as part of initial pre-processing prior to passing data to analytic algorithms.
Where possible, all functions should provide options for users to specify how to handle missing data, with options minimally including:
- error on missing data (TEST).
- warn or ignore missing data, and proceed to analyse irregular data, ensuring that results from function calls with regular yet missing data return identical values to submitting equivalent irregular data with no missing values (TEST).
- replace missing data with appropriately imputed values (TEST).

Forecasting

Where Time Series Software implements or otherwise enables forecasting abilities, it should:

Permit limits on forecasting horizon to be specified in terms of maximal threshold or divergence criteria (such as in terms of standard errors), either as:
- additional parameters to algorithmic routines alongside input data (TEST); or
- additional post-processing functions to trim output data to only those within specified threshold.
Always return either:
- A distribution object, for example via one of the many packages described in the CRAN Task View on Probability Distributions (or the new distributional package as used in the fable package for time-series forecasting) (TEST).
- At least twice the number of variables to be forecast as the number used to generate the models (one variable for mean or first-order predictions, and a second for variance or second-order predictions) (TEST).

Visualization

Time Series Software should:

Implement default plot methods for any implemented class system (TEST).
When representing results in temporal domain(s), ensure that one axis is clearly labelled “time” (or equivalent), with continuous units.
Default to placing the “time” (or equivalent) variable on the horizontal axis.
Ensure that units of the time, frequency, or index variable are printed by default on the axis.
For frequency visualization, abscissa spanning $[-\pi, \pi]$ should be avoided in favour positive units of $[0, 2\pi]$ or $[0, 0.5]$, in all cases with appropriate additional explanation of units.
Provide options to determine whether plots of data with missing values should generate continuous or broken lines.

For the results of forecast operations, Time Series Software should

By default indicate distributional limits of forecast on plot

Refer to examples below for further clarification of these points.

(… those example are then given in the associated hackmd.io document.)

mpadge · August 24, 2020, 11:14am

We now have preliminary drafts of both general standards for statistical software, and specific extensions into a few initial categories - Regression, Bayesian and Monte Carlo, Exploratory Data Analysis, and Time Series. We would very much appreciate any feedback, comments, suggestions, improvements to any and all of the current standards. Everybody is encouraged to peruse the single “master” document in bookdown form, and to provide feedback in increasing orders of formality in one or more of the following ways:

The #stats-peer-review slack channel
The relevant sections of the
discussion forum
The github repository for the “master” document, either via issues for general discussion, or pull requests for more concrete improvements.

Note that we anticipate some degree of “shuffling” between general and category-specific standards, much of which will be deferred until we have developed standards for all of our anticipated categories. There is thus one and only one form of comment for which we are currently not seeking feedback, which is comments regard whether category-specific standard X might be better considered general standard Y - that will be worked out later.

Looking forward to any and all feedback from anybody interested in helping to create our system for peer reviewing statistical software. Without further ado, the relevant standards follow immediately below.

Standards for Time Series Software

Time series software is presumed to perform one or more of the following steps:

Accept and validate input data
Apply data transformation and pre-processing steps
Apply one or more analytic algorithms
Return the result of that algorithmic application
Offer additional functionality such as printing or summarising return results

This document details standards for each of these steps, each prefixed with “TS”.

1. Input data structures and validation

Input validation is an important software task, and an important part of our standards. While there are many ways to approach validation, the class systems of R offer a particularly convenient and effective means. For Time Series Software in particular, a range of class systems have been developed, for which we refer to the section “Time Series Classes” in the CRAN Task view on Time Series Analysis", and the class-conversion package tsbox. Software which uses and relies on defined classes can often validate input through affirming appropriate class(es). Software which does not use or rely on class systems will generally need specific routines to validate input data structures. In particular, because of the long history of time series software in R, and the variety of class systems for representing time series data, new time series package should accept as many different classes of input as possible by according with the following standards:

TS1.1 Time Series Software should explicitly document the types and classes of input data able to be passed to each function.
TS1.2 Time Series Software should accept input data in as many time series specific classes as possible.
TS1.3 Time Series Software should implement validation routines to confirm that inputs are of acceptable classes (or represented in otherwise appropriate ways for software which does not use class systems).
TS1.4 Time Series Software should implement a single pre-processing routine to validate input data, and to appropriately transform it to a single uniform type to be passed to all subsequent data-processing functions (the tsbox package provides one convenient approach for this).
TS1.5 The pre-processing function described above should maintain all time- or date-based components or attributes of input data.

For Time Series Software which relies on or implements custom classes or types for representing time-series data, the following standards should be adhered to:

TS1.6 The software should ensure strict ordering of the time, frequency, or equivalent ordering index variable.
TS1.7 Any violations of ordering should be caught in the pre-processing stages of all functions.

1.1 Time Intervals and Relative Time

While most common packages and classes for time series data assume absolute temporal scales such as those represented in POSIX classes for dates or times, time series may also be quantified on relative scales where the temporal index variable quantifies intervals rather than absolute times or dates. Many analytic routines which accept time series inputs in absolute form are also appropriately applied to analogous data in relative form, and thus many packages should accept time series inputs both in absolute and relative forms. Software which can or should accept times series inputs in relative form should:

TS1.8 Accept inputs defined via the units package for attributing SI units to R vectors.
TS1.9 Where time intervals or periods may be days or months, be explicit about the system used to represent such, particularly regarding whether a calendar system is used, or whether a year is presumed to have 365 days, 365.2422 days, or some other value.

2. Pre-processing and Variable Transformation

2.1 Missing Data

One critical pre-processing step for Time Series Software is the appropriate handling of missing data. It is convenient to distinguish between implicit and explicit missing data. For regular time series, explicit missing data may be represented by NA values, while for irregular time series, implicit missing data may be represented by missing rows. The difference is demonstrated in the following table.

Missing Values

Time	value
08:43	0.71
08:44	NA
08:45	0.28
08:47	0.34
08:48	0.07

The value for 08:46 is implicitly missing, while the value for 08:44 is explicitly missing. These two forms of missingness may connote different things, and may require different forms of pre-processing. With this in mind, the following standards apply:

TS2.1 Appropriate checks for missing data, and associated transformation routines, should be performed as part of initial pre-processing prior to passing data to analytic algorithms.
TS2.2 Time Series Software which presumes or requires regular data should only allow explicit missing values, and should issue appropriate diagnostic messages, potentially including errors, in response to any implicit missing values.
TS2.3 Where possible, all functions should provide options for users to specify how to handle missing data, with options minimally including:
- TS2.3a error on missing data.
- TS2.3b warn or ignore missing data, and proceed to analyse irregular data, ensuring that results from function calls with regular yet missing data return identical values to submitting equivalent irregular data with no missing values.
- TS2.3c replace missing data with appropriately imputed values.
TS2.4 Functions should never assume non-missingness, and should never pass data with potential missing values to any base routines with default na.rm = FALSE-type parameters (such as mean(), sd() or var()).

2.2 Stationarity

Time Series Software should explicitly document assumptions or requirements made with respect to the stationarity or otherwise of all input data. In particular, any (sub-)functions which assume or rely on stationarity should:

TS2.5 Consider stationarity of all relevant moments - typically first (mean) and second (variance) order, or otherwise document why such consideration may be restricted to lower orders only.
TS2.6 Explicitly document all assumptions and/or requirements of stationarity
TS2.7 Implement appropriate checks for all relevant forms of stationarity, and either:
- TS2.7a issue diagnostic messages or warnings; or
- TS2.7b enable or advise on appropriate transformations to ensure stationarity.

The two options in the last point (TS2.7b) respectively translate to enabling transformations to ensure stationarity by providing appropriate routines, generally triggered by some function parameter, or advising on appropriate transformations, for example by directing users to additional functions able to implement appropriate transformations.

2.3 Covariance Matrices

Where covariance matrices are constructed or otherwise used within or as input to functions, they should:

TS2.8 Incorporate a system to ensure that both row and column orders follow the same ordering as the underlying time series data. This may, for example, be done by including the index attribute of the time series data as an attribute of the covariance matrix.
TS2.9 Where applicable, covariance matrices should also include specification of appropriate units.

3. Analytic Algorithms

Analytic algorithms are considered here to reflect the core analytic components of Time Series Software. These may be many and varied, and we explicitly consider only a small subset here.

3.1 Forecasting

Statistical software which implements forecasting routines should:

TS3.1 Provide tests to demonstrate at least one case in which errors widen appropriately with forecast horizon.
TS3.2 If possible, provide at least one test which violates TS3.1
TS3.3 Document the general drivers of forecast errors or horizons, as demonstrated via the particular cases of TS3.1 and TS3.2
TS3.4 Either:
- TS3.4a Document, preferable via an example, how to trim forecast values based on a specified error margin or equivalent; or
- TS3.4b Provide an explicit mechanism to trim forecast values to a specified error margin, either via an explicit post-processing function, or via an input parameter to a primary analytic function.

4. Return Results

For (functions within) Time Series Software which return time series data:

TS4.1 Return values should either:
- TS4.1a Be in same class as input data, for example by using the tsbox package to re-convert from standard internal format (see 1.4, above); or
- TS4.1b Be in a unique, preferably class-defined, format.
TS4.2 Any units included as attributes of input data should also be included within return values.
TS4.3 The type and class of all return values should be explicitly documented.

For (functions within) Time Series Software which return data other than direct series:

TS4.4 Return values should explicitly include all appropriate units and/or time scales

4.1 Data Transformation

Time Series Software which internally implements routines for transforming data to achieve stationarity and which returns forecast values should:

TS4.5 Document the effect of any such transformations on forecast data, including potential effects on both first- and second-order estimates.
TS4.6 In decreasing order of preference, either:
- TS4.6a Provide explicit routines or options to back-transform data commensurate with original, non-stationary input data
- TS4.6b Demonstrate how data may be back-transformed to a form commensurate with original, non-stationary input data.
- TS4.6c Document associated limitations on forecast values

4.2 Forecasting

Where Time Series Software implements or otherwise enables forecasting abilities, it should return one of the following three kinds of information. These are presented in decreasing order of preference, such that software should strive to return the first kind of object, failing that the second, and only the third as a last resort.

TS4.7 Time Series Software which implements or otherwise enables forecasting should return either:
- TS4.7a A distribution object, for example via one of the many packages described in the CRAN Task View on Probability Distributions (or the new distributional package as used in the fable package for time-series forecasting).
- TS4.7b For each variable to be forecast, predicted values equivalent to first- and second-order moments (for example, mean and standard error values).
- TS4.7c Some more general indication of error involved with forecast estimates.

Beyond these particular standards for return objects, Time Series Software which implements or otherwise enables forecasting should:

TS4.8 Ensure that forecast (modelled) values are clearly distinguished from observed (model or input) values, either (in this case in no order of preference) by
- TS4.8a Returning forecast values alone
- TS4.8b Returning distinct list items for model and forecast values
- TS4.8c Combining model and forecast values into a single return object with an appropriate additional column clearly distinguishing the two kinds of data.

5. Visualization

Time Series Software should:

TS5.1 Implement default plot methods for any implemented class system.
TS5.2 When representing results in temporal domain(s), ensure that one axis is clearly labelled “time” (or equivalent), with continuous units.
TS5.3 Default to placing the “time” (or equivalent) variable on the horizontal axis.
TS5.4 Ensure that units of the time, frequency, or index variable are printed by default on the axis.
TS5.5 For frequency visualization, abscissa spanning $[-\pi, \pi]$ should be avoided in favour positive units of $[0, 2\pi]$ or $[0, 0.5]$, in all cases with appropriate additional explanation of units.
TS5.6 Provide options to determine whether plots of data with missing values should generate continuous or broken lines.

For the results of forecast operations, Time Series Software should

TS5.7 By default indicate distributional limits of forecast on plot
TS5.8 By default include model (input) values in plot, as well as forecast (output) values
TS5.9 By default provide clear visual distinction between model (input) values and forecast (output) values.

bonushenricus · June 7, 2021, 3:46pm

Hello
This is my first post.
I am an agricultural technician and I am working on a project on agroforestry and halyomorpha halys in Italy.
I am not a computer scientist but I am working to create a collaboration with a group of computer scientists that I know who want to set up a cooperative, who are in my area.
Normally I would write to them, but for convenience I try to enter the forum in the meantime.
I have read mpadge’s questions.
I am rather inexperienced with R, I am trying some packages, waiting for the cooperative’s computer scientists to have time to collaborate on the subject.
I try to explain what I would need: a package that manages observations of irregular intervals, and a regularized analysis of them.
I give an example of my work: we have installed traps of traps, we have given farmers a form in kobotoolbox, but the observations do not have a regular frequency, and also the observation is the total of the catches for the period from the previous observation. up to the present. The result we would like is simply the number of weekly catches, but taking into account that an observation made for example on Thursday and then Tuesday, that of Tuesday includes halyomorpha also caught on Friday, Saturday and Sunday, which formally should be inserted in the previous week.
Perhaps it is a problem with a trivial solution, but for now I have not figured it out.
I hope I have made a contribution to understanding the issue from the side of common research activities and techniques in agricultural entomology.
Thank you

Topic		Replies	Views
Statistical Software: General Standards Statistical Software Peer Review	0	673	August 24, 2020
Statistical software peer review categories Statistical Software Peer Review	13	2167	June 19, 2020
Statistical Software: Regression and Supervised Learning Statistical Software Peer Review regression , supervised-learning	18	1284	August 24, 2020
Statistical Software: Bayesian Analyses Statistical Software Peer Review bayesian	8	1490	August 24, 2020
Statistical Software: Exploratory Data Analysis and Summary Statistics Statistical Software Peer Review eda	7	1148	August 24, 2020

Statistical Software: Time Series

Time Series Software

Standards

Class Systems

Stationarity

Missing Values

Forecasting

Visualization

Standards for Time Series Software

1. Input data structures and validation

1.1 Time Intervals and Relative Time

2. Pre-processing and Variable Transformation

2.1 Missing Data

2.2 Stationarity

2.3 Covariance Matrices

3. Analytic Algorithms

3.1 Forecasting

4. Return Results

4.1 Data Transformation

4.2 Forecasting

5. Visualization

Related topics