Statistical Software: Spatial Analyses

Tags: #<Tag:0x00007fe22350fc50> #<Tag:0x00007fe22350fb88>

Our project for reviewing statistical software now has another set of completed standards, this time for “Spatial Software”. The Standards can be viewed in the main project book. We’re keen on receiving any and all feedback on these standards, particularly on the core section of “Algorithmic Standards” (5.9.3). These standards for spatial software notably differ from our other categories in that these core algorithmic standards are primarily standards for packages which also fit into other categories (regression, unsupervised learning, and machine learning). Specific questions we’d like input into and discussion of include:

  • What might we be missing in these standards?
  • Are there other aspects of spatial statistical algorithms which might be sufficiently general to be expressible and generally applicable as standards?
  • Are there any other category-specific aspects of spatial software which might be appropriately expressed via our standards?

For ease of reference, these core algorithmic standards are:

The following standards will be conditionally applicable to some but not all spatial software. Procedures for standards deemed not applicable are described in the R package of this project.

  • SP3.0 Spatial software which considers spatial neighbours should enable user control over neighbourhood forms and sizes. In particular:
    • SP3.0a Neighbours (able to be expressed) on regular grids should be able to be considered in both rectangular only, or rectangular and diagonal (respectively “rook” and “queen” by analogy to chess.
    • SP3.0b Neighbourhoods in irregular spaces should be minimally able to be controlled via an integer number of neighbours, an area (or equivalent distance defining an area) in which to include neighbours, or otherwise equivalent user-controlled value.
  • SP3.1 Spatial software which considers spatial neighbours should enable neighbour contributions to be weighted by distance (or other weighting variable), and not rely on a uniform-weight rectangular cut-off.
  • SP3.2 Spatial software which relies on sampling from input data (even if only of spatial coordinates) should enable sampling procedures to be based on local spatial densities of those input data.

Algorithms for spatial software are often related to other categories of statistical software, and it is anticipated that spatial software will commonly also be subject to standards from these other categories. Nevertheless, because spatial analyses frequently face unique challenges, some of these category-specific standards also have extension standards when applied to spatial software. The following standards will be applicable for any spatial
software which also fits any of the other listed categories of statistical software.

Regression Software

  • SP3.3 Spatial regression software should explicitly quantify and distinguish autocovariant or autoregressive processes from those covariant or regressive processes not directly related to spatial structure alone.

Unsupervised Learning Software

The following standard applies to any spatial unsupervised learning software which uses clustering algorithms.

  • SP3.4 Spatial clustering should not use standard non-spatial clustering algorithms in which spatial proximity is merely represented by an additional weighting factor. Rather, clustering schemes should be derived from explicitly spatial algorithms.

Machine Learning Software

One common application in which machine learning algorithms are applied to spatial software is in analyses of raster images. The first of the following standards applies because the individual cells or pixels of these raster images represent fixed spatial coordinates. (This standard also renders ML2.1 inapplicable).

  • SP3.5 Spatial machine learning software should ensure that broadcasting procedures for reconciling inputs of different dimensions are not applied.
  • SP3.6 Spatial machine learning software should ensure that test and training data are spatially distinct, and not simply sampled uniformly from a common region.

The latter standard, SP3.6, is commonly met by applying some form of spatial partitioning to data, and using spatially distinct partitions to define test and training data.


I think you’re using rectilinear/curvilinear where it doesn’t apply (imo) - these are quite technical terms that refer to independence of axes for array dimension coordinates (very very domain specific), whereas I think you mean “Cartesian” vs. “angular” spaces, generally. The other aspect, whether a given projection that defines a Cartesian space derived from angular coordinates (lonlat) is suitable is pretty key, and I’ll be reviewing the text along those lines.

ok to just dump thoughts here like this? I’ll do PRs once I get warmed up :wink:

here’s the specific reference I use for rectilinear/curvilinear grids: Regular grid - Wikipedia

that’s adopted in stars package conceptually too

Thanks @mdsumner, a rectifying PR would be greatly appreciated! My terminology rests on my likely outdated understanding of usage of these terms from back in my days as a physicist. “Curvilinear” seems to have retained the sense i intend, and gets it’s own wiki-entry (as “curvilinear coordinates”), while “Recitlinear” gets a disambiguation leading to lots of pages, but no equivalent “recitlinear coordinates” page. So based on that, yes, an edit or two of my terminology would indeed seem in order.

My inclinations were motivated by:

  1. Wanting to use “rectilinear coordinates/systems” to refer to the rectilinear complement of curvilinear in exactly the sense conveyed by the wikipedia entry; and
  2. Wanting to avoid nominative reference to some random d**d w***e m*n singular person as having originated the generic idea of a rectilinear coordinate system.

Which is to say, in making a PR, any efforts to avoid having to use a term which is obligatorily capitalised would be appreciated and serve the ongoing improvement of practices of scientific nomenclature in general.

Some subtleties in this that might or might not be worth expanding on. Minor point, perhaps make this applicable not just to “machine learning” but any methods that are predictive in nature. Main point - I’m not sure that “spatially distinct” is universally appropriate or sufficient. The test set should be in some sense independent from the training set, of course, so that the testing is encountering “new” data. That might be accommodated via spatial distinctness, but more generally is often done by considering sampling units. e.g. I have a bunch of observers who do surveys in the same general area - the “sampling unit” here might be “observer” (or “survey ID”) - so it might be appropriate to stratify train/test by observer or survey, even if they overlap in space. If I stratify by space alone, I will presumably have data from all observers in both the training and testing sets - which might or might not be a good thing, depending on the application details. Or, if the predictive algorithm is essentially one of spatial interpolation, then it might make no sense at all to try and enforce spatial separation of train/test data.

So perhaps a rewording along the lines of “Spatial software that is predictive in nature … should consider whether training and test data should be sampled from spatially distinct regions … noting that the appropriate stratification of training and test data might depend on other factors including the nature of the algorithm and its application …”

Update: OK, I hadn’t realized that 3.6 was specifically about raster images. I’ll leave the comment here but you can probably disregard it!

Disregard I shall not Ben, because the point is definitely still valid. We’re attempting to formulate standards as succinctly as possible which always risks loss of nuance of exactly the kind you highlight. And yes, 3.6 is formulated to apply to raster images only, because of exactly the point you make, but nevertheless if you think the wording could be improved anywhere at all, then please PR suggestions! Maybe your concerns indicate a more general applicability of that standard, yet in a slightly reformulated way along the lines you suggest?