Thanks for responses thus far, particularly given that you’ve all joined in for the first time just to respond to this particular discussion. That’s very encouraging - thanks!
@michaelweylandt Thank you for your particularly considered responses, to which a few thoughts:
- Would “Simulation and Markov Chain Monte Carlo” be a better “bucket”? (Than current “Bayesian and Monte Carlo Algorithms”)
I am personally not satisfied with that categorical title, which arose through initially distinguishing those two in the background empirical research, and subsequently combining them due to them being very closely related (in statistical terms). I see the “Bayesian” bit being a key - and appropriate - term here, and yes that would mean packages like INLA would tick this categorical box. I actually have issues with “Monte Carlo”, and would personally prefer a term that connoted what is actually happening. The problem is that that quickly and ambiguously overlaps with general simulation approaches, so I remain unsure of a resolution. Further input greatly appreciated!
- Dimensionality Reduction and Feature Selection → “DR and Feature Engineering”?
Yeah, maybe. PCA, lasso, MDS, and all those would tick this category. You are right that a lot or most work “isn’t selective”, so I’d be tempted to agree with your suggestion there. Feel free to PR that one into our main document if you like.
- Machine Learning: Would supervised / unsupervised make for useful sub-categories?
Our subjective discussions have always assumed exactly that distinction. As main doc explains, the current list is merely reflective of what we attempted to establish an empirical basis for, which is not to say that they are necessarily optimal in any way. A distinction between supervised and unsupervised will likely be necessary somewhere, and I can definitely see a utility within this category, but also a need to note that such a distinction might be useful in other categories as well. Maybe it might be better to leave this as a single category, and have another check box where appropriate defining whether methods are supervised or not? The categories, along with potential additional items like that, are ultimately intended to guide the development of standards and assessment procedures, so the question is really whether standards for and the assessment of supervised versus unsuperivsed ML algorithms differ? (yes!) But: Do standards for, and the assessment of, supervised versus unsupervised algorithms for feature engineering differ? For EDA? For …? Again: Further suggestions appreciated
- Regression and Interpolation: Would classification go in here as well?
Yes, classification would (frequently) go in here, but also likely in Feature Engineering, ML, and other categories. Again, the issue is really about standards and assessment. Whether or not the end point is classification or prediction or interpolation, algorithms based on regression techniques are - accordingly to the hypothesis of this category - sufficiently similar to be subject to comparable standards and assessments. In this sense, while distinguishing linear from non-linear might be useful, I do not expect there to be much difference in terms of standards or assessment, and would be tempted to leave that distinction in mind as one to potentially develop in response to a direct or perceived need actually arising.
- Probability Distributions - Is this just providing
d
, p
, q
… r
functions or is there more?
There is more. This category emerges as quite a distinct cluster in our analyses, and is related to things like maximum likelihood techniques and density estimators. It is also a category for which distinct standards and assessments will likely be able to be developed with reference both to R’s very highly developed representations and techniques for probability densities and distributions, as well as things like the US National Institute of Standards and Technolgy’s collection of reference data sets, described in our document here. It’s more anticipated as a checkbox for whether a package handles or uses probability distributions at all, in which case we definitely anticipate distinct assessments being applicable.
- Wrapper Packages - Does this include … other R packages … as well?
Yes, that is certainly the vision, even if we primarily anticipate it being used to describe packages which offer wrappers around software originally written in other languages.
- EDA - combine with summary statistics? … include graphics in here as well?
Yes on both scores. EDA currently has no text in main doc because we’re unsure what to do with that, or where it might rightly belong. Merging with EDA would likely make sense. As for graphics: We have an “unwritten rule” which we have discussed extensively that we will (initially) exclude packages the primary aim of which is graphical representations, yet obviously many packages implement graphical routines, and particularly those in this category. And yes, distinguishing primary from secondary is likely to be not trivial, but to the extent that we may presume such a distinction, graphical routines will generally be assessed as some kind of “optional extra”.
- Workflow software - How broadly are y’all defining “workflow”?
Currently an admittedly ill-defined and, yes, broad notion along the lines of software that “support[s] tasks around the statistical workflow”. Workflow is indeed anticipated to be largely as general as possible, yet with the abiding restriction that the workflows which are supported must be predominantly and primarily statistical. See the above link for more detail.
Thanks for the great feedback, and please feel free to continue the discussion here.